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Abstract 



O 

' Frequent episode discovery is a popular framework for pattern discovery in event streams. An 



episode is a partially ordered set of nodes with each node associated with an event type. Efficient 
(and separate) algorithms exist for episode discovery when the associated partial order is total (serial 
episode) and trivial (parallel episode). In this paper, we propose efficient algorithms for discovering 
frequent episodes with general partial orders. These algorithms can be easily specialized to discover 
serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in 



\ the space of certain interesting subclasses of partial orders. We point out that there is an inherent 

> 



combinatorial explosion in frequent partial order mining and most importantly, frequency alone is not 
a sufficient measure of interestingness. We propose a new interestingness measure for general partial 
order episodes and a discovery method based on this measure, for filtering out uninteresting partial 
orders. Simulations demonstrate the effectiveness of our algorithms. 

I. Introduction 

Frequent episode discovery [12] is a popular framework for discovering temporal patterns 
in symbolic time series data, with applications in several domains like manufacturing [6], [16], 
telecommunication [12], WWW [9], biology [2], [14], finance [13], intrusion detection [10], [17], 
text mining [5] etc. The data in this framework is a single long time-ordered stream of events 
and each temporal pattern (called an episode) is essentially a small, partially ordered collection 
of nodes, with each node associated with a symbol (called event-type). The partial order in the 
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episode constrains the time-order in which events should appear in the data, in order for the 
events to constitute an occurrence of the episode. Patterns with a total order on their nodes are 
called serial episodes, while those with an empty partial order are called parallel episodes [12]. 
The task is to unearth all episodes whose frequency in the data exceeds a user-defined threshold. 

Currently, separate algorithms exist in the literature for discovering frequent serial and parallel 
episodes in data streams [3], [6], [12], [14], while no algorithms are available for the case of 
episodes with general partial orders. Related work can be found in the context of sequential 
patterns [1], [4], [11], [15] where the data consists of multiple sequences and the sequential 
pattern is a small partially ordered collection of symbols. A sequential pattern is considered 
frequent if there are enough sequences (in the data) in which the pattern occurs atleast once. By 
contrast, in frequent episode discovery, we are looking for patterns that repeat often in a single 
long stream of events. This makes the computational task quite different from that in sequential 
patterns. 

In this paper, we develop algorithms for discovering frequent episodes with general partial 
order constraints over their nodes. We restrict our attention to a subclass of patterns called 
injective episodes, where an event-type cannot appear more than once in a given episode. This 
facilitates the design of efficient algorithms with no restriction whatsoever on the partial orders 
of episodes. Further, our algorithms can handle the usual expiry time constraints for episode 
occurrences (which limit the time-spans of valid occurrences to some user-defined maximum 
value). Our algorithms can be easily specialized to either discover only frequent serial episodes 
or only frequent parallel episodes. Moreover, we can also specialize the method to focus the 
discovery process to certain classes of partial order episodes which satisfy what we call as the 
maximal subepisode property (Serial episodes and parallel episodes are specific examples of 
classes that obey this property). 

As we point out here, one of the difficulties in efficient discovery of general partial orders is 
that there is an inherent combinatorial explosion in the number of frequent episodes of any given 
size. This is because, for any partial order episode with n nodes, there are an exponential number 
of subepisodes, also of size n, all of which would occur at least as often as the episode. (Note 
that this problem does not arise in, e.g., frequent serial episode discovery because an n-node 
serial episode cannot have any n-node serial subepisode). Thus, frequency alone is insufficient as 
a measure of interestingness for episodes with general partial orders. To tackle this, we propose 
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a new measure called bidirectional evidence, which captures some notion of entropy of relative 
frequencies of pairs of events occurring in either order in the observed occurrences of an episode. 
The mining procedure now requires a user-defined threshold on bidirectional evidence in addition 
to the usual frequency threshold. We demonstrate the utility of our algorithms through extensive 
empirical studies. 

The paper is organized as follows. In Sec. Ill we briefly review the frequent episodes formalism 
and define injective episodes. Sec. [Till describes the finite state automata (and its associated 
properties) for tracking occurrences of injective episodes. Algorithms for counting frequencies 
of partial order episodes are described in Sec. [IV] The candidate generation is described in 
Sec. |Vj Sec. IVII-AI describes our new interestingness measure. We present simulation results in 
Sec. ETUI and conclude in Sec. E3 



II. Episodes in event streams 

The data, referred to as an event sequence, is denoted by D = ((Ei, ti), (E 2 , £2), • • • (E n , t n )), 
where n is the number of events in the datastream. In each tuple (E^ti), E i: denotes the event 
type and ti the time of occurrence of the event. The event types Ei, take values from a finite set, 

8. The sequence is ordered so that, ti < t i+ i for all i = 1, 2, The following is an example 

sequence with 10 events: 

((A, 2), (B, 3), (A, 3), (A, 7), (C, 8), (B, 9), (D, 11), (C, 12), (A, 13), (B, 14), (C, 15)) (1) 

Definition 1: [12] An iV-node episode a, is a tuple, (V„, < a , g a ), where V a = {v%, v 2 , . . . , v n} 
denotes a collection of nodes, < a is a strict partial ordeiu on V a and g a : V a — > 8 is a map that 
associates each node in the episode with an event-type (out of the alphabet 8). 
When < a is a total order, a is referred to as a serial episode and when < a is empty a is referred 
to as a parallel episode. In general, episodes can be neither serial nor parallel. We denote episodes 
using a simple graphical notation. For example, consider a 3-node episode a = (V a , < a , g a ), 
where v% < a v 2 and v\ < a v s , and with g a (vi) = B, g a {v2) = A and g a (vs) = C. We denote 
this episode as (B — > (AC)), implying that B is followed by A and C in any order. 



'A strict partial order is a relation which is irreflexive, asymmetric and transitive. 
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Definition 2: [12] Given a data stream, ((Ei,ti), . . ., (E n ,t n )) and an episode a = (V a , < a 
, g a ), an occurrence of a is a map h : V a — ► {1, . . . , n} such that <7 a (f) = E h ^ for all t> G V^, 
and for all v,w E V a with u < a io we have tuv) < £/i(iuV 

For example, ((B, 3), (A, 7), (C, 8)) and ((5, 9), (C, 12), (A, 13)) constitute occurences of (B -> 
(AC)) in the event sequence ©, while ((5,3), (A, 3), (C, 8)) is not a valid occurence since 
does not occur before A. 

Given any iV-node episode, a, it is sometimes useful to represent an occurence, h, of a 
as a vector of integers [h(l), h{2) . . . h(N)}, where h(i) < h(i + = 1, . . . , (N — 1). For 
example, in sequence ®, the occurence corresponding to the subsequence ((B, 3), (A, 7), (C, 8)) 
is associated with the vector [2 4 5] (since (-8,3), (A, 7) and (C, 8) are the second, fourth and 
fifth events in (OQ) respectively). 

Consider an iV-node episode, a, and the set, H a , of occurrences of a in event sequence D. 
The occurrences in H a can be arranged according to the lexicographic ordering of the vectors, 

[h(i),...,h(N)],hen a . 

Definition 3: [8] The lexicographic order, on the set, Ha of occurrences of an iV-node 
episode, a, in an event sequence, D, can be defined as follows: Given two different occurences 
hi and h 2 of a in D, we have hi <* h 2 iff the least i for which hi(i) ^ h 2 (i) is such that 
hi{i) < h 2 (i). 

Definition 4: [12] Episode (3 = (Vp, <p,gp) is said to be a subepisode of a = (V a , < a , g a ) 
(denoted (3 ^ a) if there exists a 1 — 1 map fp a : Vp — > V a such that (i) gp(v) = g a {fp a {v)) 
for all v G Vg, and (ii) for all v, w G Vp with f <p w, we have fp a {v) < a fp a {w) in V^. 
In other words, for (3 to be a subepisode of a, all event-types of /3 must also be in a, and the 
order among the event-types in [3 must also hold in a. Thus, (B — > A), (-B — > C) and (AC) 
are the 2-node subepisodes of (S — > (AC)). We note here that if /3 ^ a, then every occurence 
of a contains an occurence of (3. 

Given an event sequence the datamining task here is to discover all frequent episodes, i.e., 
those episodes whose frequencies exceed a given threshold. Frequency is some measure of 
how often an episode occurs in the data stream. The frequency of episodes can be defined in 
more than one way [7], [12]. In this paper, we consider the non-overlapped occurrences-based 
frequency measure for episodes [7]. Informally, two occurrences of an episode are said to be 
non-overlapped if no event corresponding to one occurrence appears in-between events of the 
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other. The frequency of an episode is the size of the largest set of non-overlapped occurrences 
for that episode in the given data stream. 

Definition 5: [7] Consider a data stream (event sequence), D, and an iV-node episode, a. 
Two occurences hi and h 2 of a are said to be non-overlapped in D if either t^rm < U 2 (i) or 
th 2 (N) < thiCi)- A set of occurences is said to be non-overlapped if every pair of occurences in 
the set is non-overlapped. The cardinality of the largest set of non- overlapped occurrences of a 
in D is referred to as the non-overlapped frequency of a in D. 

A. Injective Episodes 

In this paper, we consider a sub-class of episodes called injective episodes. An episode, 
Oi = (V a , < a ,g a ) is said to be injective if the g a is an injective (or 1-1) map. For example, the 
episode (B — > (AC)) is an injective episode, while B — > (AC) — > B is not. Thus, an injective 
episode, is simply a subset of event-types (out of the alphabet, £ ) with a partial order defined 
over it. This subset, which we will denote by X a , is same as the range of g a . The partial order 
that is induced over X a by < a is denoted by R a . It is often much simpler to view an injective 
episode, a, in terms of the partial order set, (X a , R a ), that is associated with it. From now on, 
unless otherwise stated, when we say episode we mean an injective episode. 

In this paper, we will use either (V a , < a , g a ) or (X a , R a ) to denote episode a, depending on 
the context. Although (X a , R a ) is simpler, in some contexts, e.g., when referring to episode oc- 
currences, the (V a , < a , g a ) notation comes in handy. However, there can be multiple (V a , < a , g a ) 
representations for the same underlying pattern under Definition [7] Consider, for example, two 
3-node episodes, a x = (Vi,< ai ,g ai ) and a 2 = (V 2 , < a2 , g a2 ), defined as: (i) V\ = {vi,v 2 ,v 3 } 
with v x < ai v 2 , vi < ai v 3 and g(v x ) = B, g(v 2 ) = A, g(v 3 ) = C, and (ii) V 2 = {vi,v 2 ,v 3 } 
with v 2 < a2 vi, v 2 < a2 v 3 and g(v x ) = A, g(v 2 ) = B and g(v 3 ) = C. Both a x and a 2 represent 
the same pattern, and they are indistinguishable based on their occurrences, no matter what the 
given data sequence is. (Notice that there is no such ambiguity in the (X a , R a ) representation). 
In order to obtain a unique (V a , < a ,g a ) representation for a, we assume a lexicographic order 
over the alphabet, £ , and ensure that (g a (vi), . . . , g a (vn)) is ordered as per this ordering. Note 
that this lexicographic order on £ is not related in anyway to the actual partial order, < a . The 
lexicographic ordering over £ is only required to ensure a unique representation of injective 
episodes in the (V a , < a ,g Q ) notation. Referring to the earlier example involving a\ and a 2 , we 
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TABLE I 

Some example episodes 



Episode 


Graphical Notation 


X a ,R a 


V = {vi,v 2 ,v 3 } 
g{vi) = A,g{v 2 ) = B,g{v 3 ) = C 

< Q = {(v 2 ,Vl), {v 3 ,Vl)(v 3 ,V 2 )} 


(C -» B -> A) 


X« = {A,B,C} 
R a = {(C,B),(B,A),(C,A)} 


V = {vi,v 2 ,v 3 } 
g(vi) = A,g(y 2 ) = B,g(v 3 ) = C 


(ABC) 


X a = {A,B,C} 
R a = <j> 


V = {V1,V 2 ,V 3 ,V4,} 

g(vi) = A,g(v 2 ) = B,g(v 3 ) = C,g(v 4 ) = D 

<a= {(Vl,V 3 ), (V 2 ,V 3 )(VI,V4), («2, V4)} 


(AB)^(CD) 


X a = {A,B,C,D} 
R a = {(A,C),(B,C),(A,D),(C,D)} 


V = {v 1 ,V2,v 3 ,v 4 „v 5 }. g( Vl ) = A,g(v 2 ) = B, 
g(v 3 ) = C,g(v 4 ) = D,g{v 5 ) = E 

<a= {(V1,V 2 ), (VI,V 3 ), (v 1 ,V 4 .)(v 1 ,V 5 ), (v 2 ,Vi), [v 2 , V 5 )} 


(A^((B^(DE))C)) 


X a = {A,B,C,D,E} 
R a = {(A,B),(A,C),(A,D),(A,E), 
(B,D),(B,E)} 



will use a 2 to denote the pattern (B — > (AC)). 

Finally, note that, if a and (3 are injective episodes, and if @ ^ a (cf. Definition then the 
associated partial order sets are related as follows: Xp C X a and Rp C R a . Some examples of 
injective episodes, illustrating the different notations for episodes, is given in Table H 

III. Finite State Automata for partial orders 

Finite State Automata (FSA) can be used to track occurrences of injective episodes under 
general partial orders in a manner similar to the automata-based algorithms for parallel or serial 
episodes [7], [8], [12]. In this section, we describe the basic construction of such automata. 

We first illustrate the automaton structure through an example. Consider episode (a = (AB) — > 
C). Here, X a = {A, B, C} and R a = {(A, C), (B, C)}. The FSA used to track occurrences of 
this episode is shown in Fig. [TJ Each state, i, is associated with a pair of subsets of X a , namely, 
{Qf, W™); Q1 C X a denotes the event-types that the automaton has already accepted by the 
time it arrives in state i; Wf C X a denotes the event-types that the automaton in state i is ready 
to accept. Initially, the automaton is in state 0, has not accepted any events so far and is waiting 
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£\{B} 




£\{A} 

Fig. 1. Automaton for tracking occurrences of the episode ((AB) — > C). 

for either of A and B, i.e., Qq = and Wq = {A, 5}. If we see a 5 first, we accept it and 
continue waiting for an A, i.e., the automaton transits to state 2 with Q% = {B}, = {A}. 
At this point the automaton is not yet ready to accept a C, which happens only after both A and 
B are encountered (in whatever order). If, instead of encountering a B, the automaton in state 
first encountered an A, then it would transit into state 1 (rather than state 2), where it would 
now wait for a B to appear (Thus, = {A}, Wf = {-B}). Once both A and i? appear in the 
data, the automaton will transit, either from state 1 or state 2, and move into state 3, where it 
now waits for a C (Q3 = W3 = {C}). Finally, if the automaton now encounters a C 

in the data stream, it will transit to the final state, namely, state 4 (0% = {A,B,C}, W4 = 0) 
and recognize a full occurrence of the episode, ((AB) — > C). 

In any occurence of an episode a, an event E E X a can occur only after all its parents in 
R a have been seen. Hence, we initially wait for all those elements of X a which are minimal 
elements of R a . Further, we start waiting for a non-minimal element, E, of R a immediately 
after all elements less than E in R a are seen. For each E G X a , we refer to the subset of 
elements in X a that are less than E (with respect to R a ) as the parents of E in episode, a, and 
denote it by n a (E). We now define A a , the FSA to recognise occurences of a. 

Definition 6: FSA A a , used to track occurrences of a in the data stream is defined as follows. 
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Each state, i, in A a , is represented by a unique pair of subsets of X a , namely (Qf,Wf); 
Qf C X a is the set of event-types that the automaton has accepted so far and Wf is the set 
of event-types that the automaton is currently ready to accept. The initial state, namely, state 0, 
is associated with the subsets pair, (Qq, Wg ), where Qq = and W{f is the collection of least 
elements in X a with respect to R a . Let i be the current state of A a and let the next event in the 
data be of type, E £ £ . A a remains in state i if E £ (X a \ Wf). If E £ Wf , ?/zen *4. a accepts 
E and transits into state j, with: 

Qf = Qf U {E} (2) 
Wf = {E' £ (X a \ Qf) : 7r a (E') C Q?} (3) 

W/ien Q° = X a , fan J /zence W° = j is the final state o/^l Q . 

It may be noted that not all possible tuples of (Q, W), where QCI ai WC X a , constitute 
valid states of the automaton. For example in Fig. [H there can be no valid state corresponding 
to Q = {A, C} (since C could not have been accepted without B being accepted before it). We 
list below a few properties of the valid states of the automaton. All these are easily proved from 
the above definition. 

Property 1: For any state, j, of the automaton, A a , the set, W", of event-types that A a will 
wait for in state j (as per Eq. © in Definition®, is exactly the set of least elements of (X a \ QJ) 
(with respect to the partial order R a ). 

Proof: If E is a least element of (X a \ Qf), it implies that all parents of E (if any) are 
outside (X a \ Qf). Hence, they must have already been accepted by A a (i.e. we must have 
^a(E) C Qf), and so, by ©, we have E £ Wj*. Conversely, every E' that A a is waiting for 
(according to ©) trivially belongs to (X a \ Qf) and since © also prescribes that ir a (E') C Q", 
we must have ix a (E') n >V™ = 0. Hence, such an E' must be a least element of (X a \ Qf). ■ 

Property 2: For any state, j, of automaton, ^4 Q , if (X a \Qf) is non-empty, then Wf is non- 
empty. Thus, the only state out of which A a makes no state transitions no matter what the input 
sequence (i.e. the only final state of A a ) is the one represented by the pair, (X a , <p). 

Proof: If (X a \ Qf) is non-empty, then it must contain at least one least element (with 
respect to R a ) and from Property [7] this element must be in Wf (and hence, it must be non- 
empty). ■ 
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Property 3: Given the set, Wf , of event-types, that A a will wait for in state, j, j ^ 0, the 
corresponding set of event-types accepted by the time A a reaches state j, is given by 

{E G {X a \ Wf ) s.t. ir a (E) n Wf = <p} (4) 

Thus, for any two distinct states, i and j, of A a , we must have both Qf Qf and Wf 7^ Wf . 

Proof: If the automaton, ^Iq, has accepted an event of type E (i.e. if E G <2f as per 
Definition © then all parents of E (if any) should have been previously accepted by A a , and 
hence, we must have E G (X a \ W") and ir a (E) fl Wf = 4>- To show the other way, consider 
an E G (X Q \ Wf) such that ix a (E) n Wf = 0. Now, if E ^ (i.e. if £ has not yet been 
accepted by A a as per Definition then must wait for either E or one (or more) of its 
parents, i.e. either E G Wf or ir a (E) fl Wf 7^ (which contradicts our original assumption for 
2£). This completes the proof of Property \3\ 

U 

The next two properties give an exact characterization of Q a and W Q for an episode a. They 
describe the the kind of subsets(of X a ) that actually come up as Q a s and W a s in A a . 

Property 4: Let A a denote an automaton of episode, a, as per construction. Given Q a C X a , 
Q a is the set of event-types that A a has currently accepted <^=^ VE G Q a , ir a (E) C Q Q . 

Proof: Since Qq was initially empty, for i£ G Q a , we must have had G Wf for some 
other (earlier) state, i. Now if E G Wf, then either (i) if ii a (E) = <p, then E must be a least 
element of X a with respect to R a , or (ii) 7i a (E) is non-empty, so that, by applying ([3]) for state 
i, we know i£ must have been added to Wf only if ir a (E) C Q°. But, from (0, we know 
go C Q Q . This implies 7c a (E) C Q a . 

Conversely, suppose Q a is such that Vi? G Q a , n a {E) C Q a . Consider the least element i?i 
in Q a (with respect to i? a ). E\ has no parents in Q a . By definition of Q Q , E\ has no parents 
outside Q a . Hence, E\ is also a least element of X a which implies E 1 G Wq . Hence from state 
0, A a makes a transition(on seeing E\) to a state 1 with = E\. Now consider a least element 
i?2 in Q a \Qi- One can verify on similar lines that 7r Q (£ , 2 ) C Q a . This means(from ©) that 
E 2 G Wf . Hence, A a makes a transition(on seeing E 2 ) to a state 2 with Q2 = 2? u {-^2}- This 
process continues till a stage k = \ Q a \ at which A a actually enters a state k where = Q a . 

■ 

Property 5: Let ^4 Q denote an automaton of episode, a, as per construction. Given W a C X a , 
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W a is the set of event-types that A a is currently waiting for -<=^ WE G W a , ir a (E) fl W a = 0. 

Proof: The forward direction is straightforward from ©. Conversely, consider a W a such 
that ME G W a , TT a {E) n W a = </). Consider the following set. 

Q a = {E <E (X a \ W a ) : 7r a (E) n W a = 0} (5) 

Note that this set is exactly similar to the one defined in Property \3\ We will first show that 
this Q a is such that WE G Q a , n a (E) C Q a . If this is true, then from Property Q a is a set of 
events that A a would have accepted at some stage. We next show that the W a that we started 
off with, is the set of event-types A a would wait for, after having accepted the set of events Q a . 

Consider an E G Q a . Let E' be a parent of E. We need to show that E' G Q a i.e. E' <£ 
(X a \ W a ) and n a (E') n W a = <j). If E' £ (X a \W a ), then a parent of E is in W a which 
contradicts E G Q a . If ^(i?) nW a ^, then 3 an E" G W a such that (E",E') G i? Q . But 
we also have (E f , E) G Hence, by trasitivity, E" is a parent of E in W a , which contradicts 
E G Q°. 

We now show that W a is indeed the set of least elements in X a \ Q a . Every element in 
Y = X a \ (W a U Q a ) should have a parent in W Q . Otherwise, E must be in Q a . So no element 
in Y is a least element of W a UY = X a \Q a . Consider an E G W a . We need to show that 
VE' G W a U Y, E' is not lesser than E. Since no two elements of W a are related, no E' from 
W a can be less than E. Suppose there exists an E' G Y such that (E', E) G R a . Since G F, 
it has a parent in W Q . By transitivity, E G W a has a parent in W a , which contradicts the 
definition of W a . Hence we have shown that W a would be the set of events A a would wait 
for, after having accepted the set of events Q a . ■ 

Property 6: Consider two states, i and j, of A a with sets of accepted states, Qf and 
such that Qf C Q". Let k = \Qf\Qf\. There exists a sequence of events, (Ei, . . . ,E^), on 
which A a , currently in state i, will make k state transitions, eventually arriving at a state j with 
the set of accepted events given by Q". 

Proof: The proof is very similar to the converse argument in Property For the sake 
of completeness, we give the entire argument. Let E\ G (Qf \ Qf) such that E\ is a least 
element of (Qf \ Qf) (with respect to R a ). From Property @ we know it a (E\) must belong to 
Q?(2 2?)- Since E 1 ! is (by definition) the least element of (Qf\ Qf), none of Ei's parents are 
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in (Q? \ Qf). So we must have n a (Ei) C Qf. This will ensure (from ©) that ^ G Wf , and 
so, *4. a in state i, on seeing E 1 , will make a transition to (say) state ii, with = U {-Ei} 
and >V£ = {£" G (X a \ Qg) s.t. n a (E') C Q?}. Next we consider £ 2 , a least element in 
(QJ \ Qfi)' and repeating the same argument as for ^ above, we can see that A a will now 
transit into state i 2 , with Q" = Q? U {E 2 } and W£ = {E' G (X Q \ QfJ s.t. ir a (E') C Q?}. 
Thus, for / = 2, . . . , fc, we can construct by adding the least element of (Q" \ -E/, to 

Qf_\. At each step, the number of accepted elements increments by 1, so that after accepting 
k events in this manner, A a will arrive in state i k with Q^ k = (since, |Q?| = \Qj\ and 

■ 

IV. Counting Algorithms 

The data mining task in the frequent episode paradigm is to extract all episodes whose 
frequency exceeds a user-defined threshold. Like current algorithms for frequent serial/parallel 
episode discovery [7], [12], we use an Apriori-style level-wise procedure for mining frequent 
episodes with general partial orders. Each level has two steps, namely, candidate generation and 
frequency counting. At level, I, candidate generation step combines frequent episodes of size 
(I — 1) to construct candidates of size /. It exploits the simple but powerful fact that if a pattern 
is frequent then (certain kinds of) its subpatterns are also frequent. The frequency counting step 
computes frequencies of all episodes in the candidates set and returns the set of frequent /-size 
episodes. Sec. |V] provides a detailed explanation of the candidate generation. In this section, we 
present an algorithm for obtaining the frequencies (or counting the number of non-overlapped 
occurences) of a set of general injective episodes of a given size. 

For counting the number of non-overlapped occurences of a set of serial episodes, [8] proposed 
an algorithm using only one automaton per episode. This algorithm can be generalized as 
explained below, to count non-overlapped occurences of a set of injective episodes (with general 
partial orders) by using the more general FSA of Definition [H We initialize the automaton (as- 
sociated with the episode) in its start state. The automaton would make transitions as prescribed 
by Definition \6\ We traverse the data stream and let the automaton transit to its next state as 
soon as a relevant event-type appears in the data stream. When the automaton reaches its final 
state, we increment the frequency of the episode and reset the automaton to its start state so that 
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it would track the next occurence. Since we have a set of candidates, we would have one such 
automaton for each episode. For each event in the data stream, we look at all automata waiting 
for that event and effect appropriate state transitions for all automata. Such an algorithm would 
count the non-overlapped occurrences of all candidate episodes through one pass over the data. 
Consider this counting scheme for episode (3 = (AB) — ► (CD) in the following data stream: 

((A, 1), (B, 2), (A, 3)(D, 4), (E, 5), (C, 6), (D, 7), 

(A, 8), (B, 9), (B, 10), (C, 12), (D, 14)) (6) 

The above method tracks occurences ht = ((A, 1), (B, 2), (D, 4), (C, 6)) and h 2 = ((A, 8), (B, 9), (C, 12), (D, h 
and returns frequency of 2 for this episode in this data stream. 

Though this algorithm is efficient (as it uses only one automaton per episode), it cannot 
implement any temporal constraints on occurrences of episodes. One constraint that is often 
useful in applications is the expiry time constraint which is stated in terms of an upperbound 
on the span of an occurrence. The span of an occurence is the largest difference between the 
times associated with any two events in the occurrence. Under the expiry time constraint, the 
frequency of an episode is the maximum number of non-overlapped occurrences such that span 
of each occurrence is less than a user-defined threshold. (The window-width of [12] essentially 
implements a similar constraint). An expiry time constraint is often useful because an occurence 
of a pattern constituted by events widely separated in time, may not really indicate any underlying 
causative influences. Consider counting occurrences of j3 in sequence © with an expiry time 
constraint of 4. The occurrences h\ and h 2 of (3 (that the algorithm would track as specified 
earlier) have spans 5 and 6 respectively. Hence our algorithm can only assign frequency of zero 
under the expiry time constraint. However, the occurence h 3 = ((B, 2), (A, 3), (D, 4)(C, 6)) of (5 
in sequence © has a span of 4 (satisfying the constraint). The reason why our algorithm can not 
track h 3 is that the automaton makes a state transition as soon as the relevant event-type appears 
in the data and thus it accepts the event (A, 1). We can count non-overlapped occurrences under 
an expiry time constraint if we allow more than one automaton per episode as explained below. 

Consider this example again with a modified algorithm as follows. As earlier, our automaton 
will accept (A, 1) and transit into a state that waits for a B. Now, since the automaton moved out 
of the start state we immediately spawn another automaton for this episode which is initialized 
in the start state. Now when we encounter (B, 2) the first automaton will accept it and move 
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into a state where it waits for a C or a D. The second automaton, which is in the start state, will 
also accept (B, 2) and move to a state where it waits for a A. (we would now intialize a third 
automaton for this episode because the second one moved out of the start state). Now when we 
encounter (A, 3) the second automaton would accept it and move into a state of waiting for a 
C or a D which is the same state as the first one is in. From now on, both automata would 
make identical transitions and hence we can retire the first automaton. This is because, the 
second automaton is initialized later and hence the span of the occurrence tracked by it would 
be smaller. (The third automaton would also accept (A, 3) and we will spawn a new fourth 
automaton in start state). Now when we encounter (D, 4) and later (C, 6), the second automaton 
would reach the final state. Since the occurence tracked by this automaton satisfies the expiry 
time constraint, we can now increment the frequency and then retire all other automata of this 
episode. We will also spawn a fresh automaton for this episode in the start state so that we can 
begin to track the next non-overlapped occurence (if any) of this episode. 

We can now specify the general method for counting under expiry time constraint as follows. 
Instead of spawning a new automaton only after the existing one reaches its final state, we spawn 
a new automaton whenever an existing automaton accepts its first event (i.e., when it transits out 
of its start state). Each of the automata makes a state transition as soon as a relevant event-type 
appears in the data stream. When counting like this, it is possible for two automata to reach the 
same state. In such cases, we drop the older one (retaining only the most recent automaton). This 
strategy tracks, in a sense, the innermost occurence amongst a set of overlapping occurences 
that end together. When any automaton (of an episode) reaches the final state, we check whether 
the span of the occurence tracked by this automaton satisfies the expiry time constraint. If it 
does, we increment the frequency and retire all the automata of that episode except for one 
automaton in the start state to track the next occurence. If the span of the occurrence tracked 
by the automaton that reached the final state does not satisfy the expiry time constraint, then 
we only retire that automaton. This is the algorithm that we use for counting the frequencies of 
episodes. 

The pseudocode for counting non- overlapped occurrences of general injective episodes with 
an expiry-time constraint is given in Algorithm [7] The inputs to the algorithm are: C/, the set 
of /-node candidate episodes, D, the event stream, £, the set of event-types, 7, the frequency 
threshold, and Tx, the expiry-time. The algorithm outputs the set, Tu of frequent episodes. The 
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event-types associated with an /-node episode, a, are stored in the a.g\\ array - for i = 1, . . . , I, 
a.g[i) is assigned the value g a (vi). We store the partial order < a , associated with the episode 
as a binary adjacency matrix, «.e[][]. The notation is: <xe[z][j] = 1 iff t>j < a Vj (or equivalently, 
if (a.g[i],a.g[j\) G R a ). 

The main data structure is an array of lists, waits(), indexed by the set of event-types. The 
elements of each list store the relevant information about all the automata that are waiting for 
a particular event-type (and hence can make a state transition if that event-type appears in the 
data stream). The entries in the list are of the form (a, q, w, j) where a is a candidate episode, 
(q, w) is one of the possible states of the automaton associated with a (cf. Definition © and j 
is an integer. For an event-type E, if (a,q, w,j) G waits(E), it denotes that an automaton of 
the episode a (with a.g[j] = E) is currently in state (q, w) and is waiting for an event-type E 
to make a state transition. Recall from Definition |6| that each state of the automaton is specified 
by a pair of subsets, (<2 Q ,W Q ), of the set of event-types X a of a. In our representation, q 
and w are \X a \ -length binary vectors encoding the two sets (<2 a ,W a ). Consider the earlier 
example episode (3 = (AB) — > (CD). For this, we have X 13 = {A, B,C, D}. Suppose this 
automaton has already accepted an A and B and is waiting for a C or D. So, its current state is 
({A, B}, {C, D}). This automaton would be listed both in waits(C) and waits(D). We would 
have (f3, q, w, 3) G waits(C) and q, w, 4) G waits(D) where q = [110 0] and w = [0 011]. 
Thus, in general, for an automaton in state (Q a , W a ), there would be \W a \ tuples in the different 
waits lists with the tuples differing only in the fourth position. As we traverse the data, if the 
next event is of event-type E, then we acces all the automata waiting for E through waits(E) 
and effect state transitions. Knowing the current state of the automaton, we can compute its 
next state after accepting E because we have the partial order of the episode stored in a.e array. 
Since, as explained above, an automaton can be listed in multiple waitsi) lists (because it can be 
waiting for a set of event-types), we have to ensure that the state transition is properly reflected 
in all waitsi) lists. 

In addition to the a.g and a.e arrays, the other pieces of information that we store with an 
episode a are: a.freq, a.init and a.w start . The frequency of an episode is kept track of in 
a.freq. For each episode a, a.init is a list that keeps track of the times at which the various 
currently active automata of a made their transition out of the start state. Each entry in this list 
is a pair (q, t), indicating that an automaton initialized (i.e., made its first state transition) at 
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time t is currently in a state with the set of accepted events represented by q. Since a start-state 
automaton is yet to make its first transition, there is no corresponding entry for this in a.init. 
The information in a.init is necessary to properly take care of situations where an automaton 
transits into a state already occupied by another automaton. It is also useful to check that the 
span of an occurence satisfies expiry time constraint before incrementing frequency. a.w start 
is a | X a | -length binary vector encoding the set Wq , the set of all least elements of X a (with 
respect to R a ). In other words, it encodes the set of all event-types for which an automaton for 
a would wait for in its start state. Since this information is needed everytime an automaton for 
the episode is to be initialized, it is useful to precompute it. 

We now explain the working of Algorithm\l\by referring to the line numbers in the pseudocode. 
Lines 4 — 12 initialize all the waits() lists by having one automaton for each candidate episode, 
waiting in its start state. In the main data pass loop (lines 15 — 65), we look at each item (E k) t k ), 
k — 1, 2 ... n, in the event stream and modify the waitsQ lists to affect state transitions of all 
automata waiting for E k . This is done by accessing each tuple in waits(E k ) list and processing 
it which is done in the loop starting on line 17. This is the main computation in the algorithm 
and we explain it below. For a tuple (a, q cur , w cur , j) e waits(E k ), we need to affect a state 
transition (since we have seen E k ). The next state information for this automaton is denoted as 
Qnxu w nxt m the pseudocode. We compute q n;rt by setting j th bit to one (line 20). Recall that 
in the start state we will have q = (vector of all zeros). Hence if q cur = 0, it means that 
this automaton is making its first transition out of the start state and hence we add (q, nx t,t k ) 
to a.init list in line 22. (Recall that a.init contains all active automaton for episode a and for 
each automaton we record its current state and the time at which it made its transition out of 
the start state). Also, when q cur = 0, since this automaton is now moving out of its start state, 
we need a new automaton for a initialized in its start state. We do this by remembering a in 
a temporary memory called bag. (We accumulate all episodes for which new automata are to 
be initialized in the start state, in this temporary memory called bag while processing all tuples 
in waits(E k ). Then, after processing all tuples in waits(E k ), we initialize all these automata 
in lines 58 — 62). The final state of the automaton corresponds to q becoming 1, a vector of 
all ones. If q nxt ^ 1, then this automaton, after the current state transition, is still an active 
automaton for a and hence we need to update the a.init list by reflecting the new state of this 
automaton which is done in lines 25 — 28. When q nxt ^ 1, to complete the computation of its 
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next state, we need to find w nxt . This automaton has now accepted its j event type. Hence, 
using the partial order information contained in a.e array, we need to find what all new event 
types it is ready to accept. Using this, we can compute w nxt as done in lines 31 to 37. It is 
computed based on the children of E k in R a as follows. It is easy to verify based on Definition 
ithat W: xt = (W^ r \E k ) U W, where 

W = {children E' of E k in R a : n a (E) C Q a nxt } 

We then need to put this automaton in the waitsQ list of all those event types that it can accept 
now. Also, we should modify state information in the waitsQ lists of event types corresponding 
to its previous state. This is done in lines 39 — 43. We point out that in this process, the waitsQ 
lists would end up having duplicate elements if there is already an automaton in the state q. nx t- 
If after the current state transition, the automaton came into a state in which there is another 
automaton of this episode, then we have to remove the older automaton. Presence of an older 
automaton is indicated by an entry (q nxt , t') for some t' in the a.init list. If t' < t cur , where t cur . 
is the time when the current automaton made its first state transition, then we need to remove 
the older automaton, which is done in lines 45 — 47. We would also need to remove one of the 
duplicate elements in the appropriate waitsQ lists as indicated in line 48 — 5(^1 If q nx4 = 1 
(so that we have now reached the final state), then we need to check whether the span of the 
occurence tracked by this automaton satisfies the expiry time constraint. We can compute the 
span because we know t cur , the time at which this automaton accepted the first event-type, from 
the entry for this automaton in a.init list. If the span of the occurence tracked is less than expiry 
time, then we increment the frequency and remove all the other active automaton of this episode 
and then start a new automaton in the start state (lines 52 — 57). This completes the explaination 
of Algorithm [Q 

In the algorithm discussed above we are implicitly assuming that different events in the data 
stream have distinct time stamps. This is because, in the data pass loop (starting on line 15) an 
automaton can accept E k+ i after accepting E k in the previous pass through the loop. We now 
indicate how one can extend Algorithm\l}to handle data with multiple event types having the same 
time-stamp. For such datastreams, event-types sharing the same time-stamp must be processed 

2 The steps of automata transition and check for an older automaton can be combined and carried out more efficiently. For 
ease of explanation we have presented the two steps separately. 
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together. One needs to perform unconditional state transitions of all the relevant automata, till 
all event-types occuring at a given time are parsed. The state transition step needs a slight 
modification here compared to that of Algorithm [7J Consider an automaton for the episode 
(BC) — > D waiting in its start state (Q" ur , W^ tr )=(0, {B, C}). Suppose we have the event- 
types B, C and D happening together at a time t. Let us denote the set of event-types occuring 
at time t by S. On processing S, we would need to accept both B and C but not D, though after 
accepting B and C it transits into a state where it waits for D. In general, an automaton waiting 
for a set of event-types W^ r just before time t, should accept the set of events S PI W^ r on 
seeing the set of event-types S at time t. Accordingly, for the next state, Q^ rt = Q" ur U(Sf]W" ur ). 
Wnxt can be computed from Q" xt as in Definition Equivalently, we could do the same by 
processing event-type by event-type as in Algorithm [7J but such a strategy needs some extra 
caution. Suppose we had C followed by B and finally followed by D in the event stream, but 
all with the same associated time t. We parse C and move to a state ({C}, {B}). On parsing 
B, we move to a state ({B,C}, {D}). Now, next on processing D if we accept it, we move 
to ({B, C, D}, 0). But < (C, t), (-£>,£), (D,t) > is not a valid occurence as D's occurence time 
must be strictly greater than that of C and B. Hence even though we add (a, [1 1 0], [0 1], 3) to 
waits(D) after seeing (B,t), this potential transition cannot be active at time t. The important 
thing to note that is this element was freshly added to waitsQ after we started processing S. 
Hence, such potential transition information after adding to waitsQ must be initially inactive, 
till, all event-types at the current time are parsed. Such waitsQ elements must be made active 
just before parsing event-types of the next time-instant. After performing the state transitions 
pertaining to all event types at the current time instant, the rest of the steps are essentially the 
same as in Algorithm [7] First, we perform the multiple automata check (there can be more than 
two automata in the same state now) and removal of all older automata if necessary. We follow 
this by the frequency incrementing step. Since we increment frequency only after parsing all 
event-types at a given time, we need to store the automata that reach the final state too during 
the state transition step. Finally, using the bag list, we add automata initialised in the start state, 
before processing the event-types occuring at the next time tick. 
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Algorithm 1: CountFrequencyExpiryTime(C/, D, 7, 8, T x ) 



Input: Set of candidate episodes, event stream D — {{E\ , ti ) , . . . , (E n , t n )), frequency threshold 7, set € of event types (alphabet), Expiry Time, T x 
Output: Set T of frequent episodes out of C; 

1 Ti <— 4> ana bag *— <p '< 

2 foreach event type E £ € do waits[E] *— 4>; 

3 /* Initialization of the waits() lists */ 

4 foreach a £ C/ do 

5 a.freq < — and a:.w S £ ar £ * — 0; 

6 for i 1 to I a I do 

7 i <- 1 ; 

8 while (j < |a| and a.e[j][i] = 0) do j +— j + 1 ; 

9 if (j = |a| 4- 1J then a.w stort [£] <— 1; 

10 for i ^ 1 to |a| do 

11 if a. w s t a rt [i] — 1 then 

12 Add (a , 0, a.w S £ ar £ , i) to Kjai£s[a: . <?[i]] ; 

13 /* is a vector of all zeros */ 

14 /* Database pass */ 

15 for k <— 1 to n do 

16 /* n is the number of events in the event stream */ 

17 foreach (a, c\ cur , w cur , j) £ ufaiis[.E fc ] do 

18 /* E k - currently processed event-type in the event stream */ 

19 /* Transit the current automaton to the next state */ 

20 q nxt *- q cur and q_ nxt [j] «- 1; 

21 if q ctir = O then 

22 Add (q nx t, t k ) to a.init and Add a to bag; 

23 /* t k - time associated with the current event in event stream */ 

24 else 

25 if cinxt 1 then 

26 /* 1 is a vector of all ones */ 

27 Update (q cwr , t C t*r) hi a.init to (q na; t , tcur)\ 

28 /* i C t*r would be the first state transition time of the current automaton */ 

29 if (q nxt ^ 1 ) then 

30 Wnxt *— Wcur, W nx t[j] <— and Wtemp *~ V*nxt '< 

31 for i «- 1 to |a| do 

32 if a.e[j][i] — 1 then 

33 /ia «- TRUE ; 

34 for (k' *- 1; fc' < |a| and / l s - TRUE; fc' <- fe' + 1) do 

35 if oc.e[k r ] [i] — 1 and <\ n xt [k f ] = then 

36 fig «- FALSE ; 

37 if - TRUE then w naJ t[i] <- 1; 

38 for i 4— 1 to j ck | do 

39 if wt emp [£] — 1 then 

40 Replace (a, q cur , w cur , i) from waits[a.g[i}] to (a, q^^t) w nx (;. i)\ 

41 if (w£ erri p[j] = and Wnxt[i] — 1) then 

42 Add (a, qn^t , w n xt , i) to waits[a.g[i]]; 

43 Remove (a,q cur , w cur , j) from waits[a.g[j\\; 

44 /* Removing an older automaton if any in the next state */ 

45 if ((q-nxti *') G a.init and t' < t cur ) then 

46 /* t' is the first state transition time of an older automaton existing in state q^i */ 

47 Remove (q n a;ti O from a.init; 

48 for £ 1 to | a | do 

49 if w n xt[i] = 1 then 

50 Remove (a, q M t, w nx t, i) from ifja^s[a.g[i]]; 

51 /* Increment the frequency */ 

52 if {^ nxt = 1 and (t k - t cur ) < T x ) then 

53 a.freq < — a.freq + 1 and Empty a.init list; 

54 for i *— 1 to | a | do 

55 foreach (a, q, w, i) £ u>aits[a.a[i]] do 

56 Remove (a, q, w, i) from u;aits[a.ff [i]] and Add a to bag; 

57 /* Add automata initialized in the start state */ 

58 foreach a £ bag do 

59 for i *— 1 to | a | do 

60 it a. w s tart [i] — 1 then 

61 Add (a, 0, a.Wstart , to u>ai£s[a.a[i]]; 

62 Empty bag; 

63 foreach a £ C; do if a.freq > 7 then Add a to J^; 

64 return T\ 
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A. Space and time complexity of Algorithm\T\ 

The number of automata that may be active (at any time) for each episode is central to the 
space and time complexities of the Algorithm [7J The number of automata currently active for a 
given /-node episode, a, is one more than the number of elements in the a.init list. We now show 
that there can be atmost / entries in the a.init list of Algorithm^ Recall that (q^,^.) e a.init 
means that there is an automaton of episode a that is currently active which made its transition 
out of the start state at time t^ and is currently in state q,. Suppose there are m entries in a.init 
list, namely, (qi,^), . . . , (q m ,t im ), with t h < t i2 < ■ ■ ■ < t im . Let . . . , Q° } represent 
the corresponding sets of accepted event-types for these active automata. Consider k, I such that 
1 < k < I < m. The events in the data stream that affected transitions in the / th automaton 
(i.e. automaton which moved out of start state at would have also been seen by the k th 
automaton. If the k th automaton has not already accepted previous events with the same event- 
types, it will do so now on seeing the events which affect the transitions of the I th automaton. 
Hence, Qf C Q£ for any 1 < k < I < m. Since Q a C X a and \X a \ = I, there are at most I 
(distinct) telescoping subsets of X a , and so, we must have m < I. 

The time required for initialization in Algorithm\I\h 0(\£ \ + \Ci\l 2 ). This is because, there are 
\S\ waits() lists to initialize and it takes 0(l 2 ) time to find the least elements for each of the \Ci\ 
episodes. For each of the n events in the stream, the corresponding waitsQ list contains no more 
than l\Ci\ elements as there can exist atmost /-automata per episode. The updates corresponding 
to each of these entries takes 0(l 2 ) time to find the new elements to be added to the waitsQ 
lists. Thus, the time complexity of the data pass is 0(nl 3 \Ci\). 

For each automaton, we store its state information in the binary /-vectors q and w. To be able 
to make |W| transitions from a given state, we maintain |W| elements in various waitsQ lists 
with each element ready to accept one of the event-types in W. Hence, for each automata we 
require 0(l 2 ) space to store the state and its possible transitions. Since there are / such automata 
in the worst case, the space complexity is 0(Z 3 |C|). 

V. Candidate Generation 

Recall that the episode discovery process employs a level-wise procedure where, each level 
involves the two steps of candidate generation and frequency counting. In Sec. [TV] we described 
the frequency counting algorithms. In this section, we describe the candidate generation algorithm 
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for injective episodes with general partial orders. The input to the candidate generation algorithm, 
at level (/ + 1), is the set, T\, of frequent episodes of size /. Under the frequency measure (based 
on non- overlapped occurences), we know that no episode can be more frequent than any of its 
subepisodes. The candidate generation step exploits this property, to construct the set, Ci + i, of 
(/ + l)-node candidate episodes. 

Recall (cf. Sec. III-A|) that it is simpler to view an injective episode a = (V a , < a , g a ), in terms 
of its associated partial order set, (X a , R a ). Each episode in Ti is represented by an /-element 
array of event-types, a. g, and an / x / matrix, a.e, containing the adjacency matrix of the partial 
order. The array a. g exactly contains the elements of X a sorted as per the lexicographic ordering 
on the alphabet 8. We refer to a.g[i] = g a (vi) as the i th node of a. Note that the i th node of an 
episode has no relationship whatsoever with the associated partial order R a . 

The principal task here is to generate all possible (/ + l)-node candidates such that, each of 
their /-node subepisodes are frequent. Each (/ + l)-node candidate is generated by combining 
two suitable /-node frequent episodes (out of JF Z ). We first explain which pairs of episodes in 
Ti can be combined and then explain how to combine them to get (/ + l)-node episodes. Every 
pair of /-node frequent episodes, a, (3 G Tu such that exactly the same (/ — l)-node subepisode 
is obtained when their respective last nodes are dropped, can be combined to obtain one or more 
potential (/ + l)-node candidates. Thus, the episodes (C — > A — > B) and (A — > D — > B) would 
be combined since the same subepisode, namely (A — > B) is obtained by dropping the last nodes 
of (C — > A — > B) and (A — ► D — > B), which are C and D respectively. Episodes (C -> A -> B) 
and {(AB) — > D) would not be combined (since different subepisodes, namely (A — > B) and 
(AB), are obtained on dropping the last nodes of (C — » A — > B)) and ((AB) — > D)). For every 
such constructed candidate episode 7, if all its /-node subepisodes are frequent, (i.e. if they can 
all be found in T\) 7 is declared a candidate episode and is added to the output set, Ci + \. We 
can formalize this notion of which pairs of episodes can be combined, as given below. 

For an injective episode a, let X a = {x", . . . , x"} denote the / distinct event-types in a, 
indexed in lexicographic order. We combine two episodes a.\ and a 2 such that the following 
two conditions hold: (i) x" 1 = x" 2 , i = 1, ...,(/- 1), x? 1 ^ xf 2 and (ii) R ai \ {x ^\ {x -i }) 
= R 012 \(x a 2\{ x ™ 2 }) (i- e - me restriction of R ai to the first (/ — 1) nodes of a\ is identical to the 
restriction of R a2 to the first (/ — 1) nodes of a 2 ). To ensure that the same pair of episodes are 
not picked up two times, we follow the convention that an and a 2 are such that xf 1 < xf 2 under 



21 



the lexicographic ordering. 

We first illustrate the process of constructing potential candidates through some examples. 
Each pair of episodes ol\ and a 2 , sharing the same (/ — l)-node subepisode on dropping their 
respective last nodes can lead to a maximum of three potential candidates, denoted by y , 3\ and 
3^2- Consider the oi\ and a 2 of Fig. [2l We construct y as a simple union of a\ and a 2 , i-e- we set 
X y ° = X ai U X a2 and R y ° = R ai U FT 2 . As it turns out, in this example, R y ° is a valid partial 
order over X y ° (satisfying both anti-symmetry as well as transitive closure) and hence, y is a 
valid injective episode (and a potential 5-node candidate). There is no edge in y Q between the 
last two nodes (i.e. the nodes corresponding to event-types D and E respectively). By adding 
an edge from D to E we get another valid partial order with the relation R y ° U {(D,E)}, and 
this corresponds to a second injective candidate, 3^1, that we can construct using the a.\ and a 2 
of Fig. 12 Similarly, R y ° U {(E, D)} corresponds to a valid partial order and this gives us a third 
potential candidate from the same a 1 and a 2 . But not all pairs of episodes can be combined in 
this manner to construct three different potential candidates. For example, for the cti and a 2 of 
Fig.S 3\ is the only potential candidate. While (X yi , R yi ) obeys transitive closure, (X y °, R y °) 
is not transitively closed because (D,C) and (C,E) belong to R y °, but (D,E) does not. For 
the same reason (X^ 2 ,/^ 2 ) is not transitively closed either. In the example of Fig. @] 3^o and 
are potential candidates (but 3^2 is not a valid potential candidate because (B, E) and (E, D) 
are in R y2 , while (B,D) is not). 

Thus, the general strategy for combining an episode ol\ with a valid a 2 , satisfying the two 
conditions mentioned before, is as follows. We attempt to construct an (I + l)-node candidate 
from «i and a 2 , by appending the last node of a 2 to the last node of «i. There are three 
possibilities to consider for combining ot\ and a 2 : 

X y ° = X ai U X a2 , R y ° = R ai U R a2 (7) 

X yi = X ai U X a2 , R yi = R y ° U {(xf\xf 2 )} (8) 

X V2 = x *i y x *2 ; R y = R y y | ( x « 2 } x cn } i (9) 

In each case, if R yj is a valid partial order over X y i, then the (/ + l)-node (injective) episode, 
{X y \ R yj ) is considered as a potential candidate. To verify the same, we need to check for the 
antisymmetry and transitive closure of the above three possibilities. One can show that each R yj 
satisfies antisymmetry because cti and a 2 share the same (/ — 1) subepisode on dropping their 
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Fig. 2. All combinations 34), 34 and 3^2 come up. 

last nodes. To check for transitive closure of (X yj , R yj ) we would need to ensure that for every 
triple zi, z 2 , z 3 e X y \ if (z\, z 2 ) £ R yj and (z 2 , z 3 ) e -R^, then we must have (zi, z 3 ) e 
However, since (i? Ql Uff 2 ) C i?^ and since (X^iT 1 ) and (I fl2 ,fi a2 ) are already known 
to be transitively closed, we need to perform the transitivity closure check only for all size-3 
subsets of X y i that are of the form {xf 1 , x" 2 , x" 1 : 1 < i < (I - 1)}. Hence, the transitivity 
closure check is 0(1). Finally, if all the /-node subepisodes of 3^ can be found in T\ then 3^ is 
added to the final candidate list, Ci+i, that is output by the algorithm. 

Interestingly, one need not check whether all the /-node subepisodes of a potential (/ + 1)- 
node candidate 3^ are in Ti. The number of such sub-episodes can in general be very large. It 
is enough to check whether all the /-node subepisodes obtained by restricting R y j to an /-node 
subset of X y i are present in Ti. For example, consider a 3-node episode (X a = {A, B, C}, R a = 
{(A, B), (A,C)}). Its 2-node sub-episodes are the serial episodes A — > B and A — > C, and 
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Fig. 3. Edges (D, C) and (C, E) prevent y a and 3^2 from coming up as candidates 



D 




Fig. 4. All nodes A, B and C prevent 3^2 from coming up. 

parallel episodes (A B), (AC) and (BC). So in general, an (/ + l)-node injective episode has 
more than (/ + 1) /-node subepisodes. Let us consider those /-node sub-episodes of (X a , FT) 
which are obtained by restricting R a to a /-subset of X a . We can have (/ + 1) such subepisodes. 
In this example, A — > B, A — > C and (5 C) are the three 2-node subepisodes of a obtained 
by restricting R a to all the possible 2-element subsets of X a . Note that the remaining 2-node 
subepisodes of A — > (BC), namely (AS) and (AC), are also subepisodes of one or the other 
of these three 2-node subepisodes . For any iV-node episode a, let us denote by Ml, the set 
of all /c-node subepisodes (k < N), obtained by restriction of R a to A;-subsets of X a . We note 
the following. For every A:-node subepisode 7 of a, there exists a (3 e such that 7 is 
a subepisode of j3. Also for every /3 e there exists no other 5 e such that /3 is a 
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Algorithm 2: GenerateCandidates(jF^) 



Input: Sorted array, Ti, of frequent episodes of size Z 
Output: Sorted array, C;+i, of candidates of size (I + 1) 

1 Initialize C;+i «— and k <— 0; 

2 if Z = 1 then 

3 for ft, <— 1 to | Ti | do Ti[h].blockstart <— 1; 

4 for i <- 1 to |^| do 

5 currentblock start *— k + 1; 

6 for «— i + 1; Ti[j].blockstart = J-i[i].blockstart; j j + 1) do 

7 if^[£].0[Z]^[j]. 5 [Z] then 

8 7> <— GetPotentialCandidates (^[i], 

9 foreach a G V do 

10 /Z 5 <- TRUE ; 

11 for (r <— 1; r < Z and /Zg =TRUE; r <— r + 1) do 

12 for x <— 1 to r — 1 do 

13 Set /?.g[a;] = a.g[a;]; 

14 for z <— 1 to r — 1 do /3.e[:c][z] <— a.e[x][z]; 

15 for z «— r to Z do /3.e[a;][z] <— a.e[x][z + 1]; 

16 for .x < — v to Z do 

17 (3-g[x] <— a.<?[x + 1]; 

18 for z «— 1 to r — 1 do /3.e[x][z] <— a.e[x + l][z]; 

19 for z <— r to Z do /3.e[ir][z] <— a.e[x + l][z + 1]; 

20 if (3 <£ Ti then /Zp «- FALSE ; 

21 if ,/Z.g = TRUE then 

22 fe <— fc + 1; 

23 Add a to Cj+i; 

24 C/+i[Zc].6Zoc/csZart <— currentblockstart; 

25 return C; + i 



subepisode of 5. Hence, is maximal for the set of all &;-node subepisodes of a. Therefore 
in the rest of the paper, we refer to subepisodes obtained by dropping^ one or more nodes as 



3 We refer to a subepisode (of an episode a) obtained by restricting R a to a strict subset of X a , as a subepisode obtained by 
dropping one or more nodes. 
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maximal subepisodes. Hence, if all the maximal /-node subepisodes of a potential (/ + l)-node 
candidate are frequent, then all its /-node subepisodes must also be frequent, which means it is 
enough to check if all the /-node maximal subepisodes of a potential candidate are frequent. 

For a given frequent episode a±, we now describe how one can efficiently search for all other 
combinable frequent episodes of the same size. At level 1 (i.e. / = 1), we ensure that T\ is 
ordered according to the lexicographic ordering on the set of event types £. Let !Fi[i\ denote 
the % th episode in the collection, Tu of /-node frequent episodes. Suppose T\ consists of the 
frequent episodes A, C and E, then we have = A, ^[2] = C and ^[3] = E. All 

the three 1-node episodes share the same sub-episode 0, on dropping their last event. As per 
the candidate generation algorithm, any two 1-node episodes are combined to form a parallel 
episode and two serial episodes. Accordingly here, episode A is combined with C and E to form 
6 candidates in C 2 . Similarly, C is combined with E to add three more candidates to C 2 . Note 
that the first 6 candidates share the same 1-node subepisode A on dropping their last event. Also, 
the next three candidates share a similar 1-node subepisode C, on dropping their last event. The 
candidate generation procedure adopted at each level here, is such that the episodes which share 
the same subepisode on dropping their last events appear consecutively in the generated list of 
candidates, at each level. We refer to such a maximal set of episodes as a block. In addition, 
we maintain the episodes in each block so that they are ordered lexicographically with respect 
to the array of event types. Since, the block information aids us to efficiently decide the kind 
of episodes to combine, at each level right from level one, we store the block information. At 
level I, all nodes belong to a single block. For a given ct\ E Tu the set of all valid episodes 
(ol<i) (satisfying the conditions explained before) with which ol\ can be combined, are all those 
episodes placed below ot\ (except the ones which share the same set of event types with a{) in 
the same block. All candidate episodes obtained by combining a given cci with all permissible 
episodes (a 2 ) below it in the same block of Tu will give rise to a block of episodes in C/+i, each 
of them having a,\ as their common /-node sub-episode on dropping their last nodes. Hence, the 
block information of C t+1 can be naturally obtained during its construction itself. Even though 
the episodes within each block are sorted in lexicographic order of their respective arrays of 
event-types, we point out that the full T\ doesn't obey the lexicographic ordering based on the 
arrays of event-types. For example, the episodes ((AB) — > C)) and (A — > (BC)) both have the 
same array of event-types, but would appear in different blocks (with, for example, an episode 
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like ((AB) — »■ D) appearing in the same block as ((AB) — > C), while (A — > (BC)), since it 
belongs to a different block, may appear later in 

The pseudocode for the candidate generation procedure, GenerateCandidates ( ) , is listed 
in Algorithm^ The input to Algorithm\2\h a collection, T h of /-node frequent episodes (where, 
J-'ili] is used to denote the i th episode in the collection). The episodes in T\ are organized in 
blocks, and episodes within each block appear in lexicographic order with respect to the array 
of event types. We use an array T\.hlockstart to store the block information of every episode^ 
J r [.blockstart[i] will hold a value k such that J 7 i[k] is the first element of the block to which 
J-'ili] belongs to. The output of the algorithm is the collection, Ci+i, of candidate episodes of size 
(/ + 1). Initially, Ci + i is empty and, if I — 1, all (1-node) episodes are assigned to the same block 
(lines 1-3, Algorithm^. The main loop is over the episodes in T\ (starting on line 4, Algorithm^. 
The algorithm tries to combine each episode, J-i[i], with episodes in the same block as jF^fi] 
that come after it (line 6, Algorithm EJ. In the notation used earlier to describe the procedure, 
we can think of jF z [z] as cui and J 7 ^] as a 2 . If J-'ili] and J-'ilj] have identical event-types, we do 
not combine them (line 7, Algorithm^. The GetPotent ialCandidates ( ) function, takes 
J-'ili] and J-'ilj] as input and returns the set, V, of potential candidates corresponding to them 
(line 8, Algorithm H]). This function first generates the three potential candidates by combining 
J-'ili] and J-'ilj] as described in equations ©,([8]) and ©. For each of the three possibilities, 
it then does a transitive closure check to ascertain their validity as partial orders^. For each 
potential candidate, a E V, we construct its /-node (maximal)subepisodes (denoted as (3 in the 
pseudocode) by dropping one node at-a-time from a (lines 13-19, Algorithm^. Note that there 
is no need to check the case of dropping the last and last-but-one nodes of a, since they would 
result in the subepisodes J 7 ^] and J-'ilj], which are already known to be frequent. If all /-node 
maximal subepisodes of a were found to be frequent, then a is added to Ci + %, and its block 
information suitably updated (lines 20-24, Algorithm^ . 

4 a similar array for storing block information is used for parallel and serial episode candidate generation in [12] 
5 As explained before, one only needs to do a transitivity check on size-3 subsets of the form {xf 1 , xf 2 , x" 1 : 1 <i< (I — 1)} 
separately on the three possibilities. Actually we can save time in the transitivity check further. As explained in appendix, we 
need to generate only the combination and perform some special checks on its nodes to decide the valid partial orders to be 
generated in V ■ 
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A. Correctness of Candidate Generation 

In this section, we address two important questions regarding the candidate generation. The 
first question is whether a given partial order is generated more than once in the algorithm. The 
second question is about whether every frequent episode is generated by our candidate generation 
scheme. 

We now address the first question in detail. It is easy to see from equation CD) to © that two 
partial orders generated from a given pair (a 1; a 2 ) of /-node episodes are all different. Hence we 
need to consider whether the same candidate is generated from two different pairs of episodes. 

Suppose an exactly same candidate is generated from different pairs (a^a^) and (a'^a^)- 
Call them y r and y' g where r and s vary from to 2 depending on the type of combination of the 
episode pairs. First consider the case when both these candidates come up as y and y' Q . Note 
that y = (X y °, R y °) = (X ai UX° 2 ,R ai UR a2 ) and y' Q = (X y °,R y o) = (X< Ul^.^U^). 
Since the candidates are same, y = X- Tms implies (i)X y ° = X y o and (ii)R y ° = R y °. (i) 
implies X ai Ul" 2 = X< UX a 'i . Recall from the conditions for forming candidates that X ai UX a2 
= X ai U {xf 2 } = {x"\. ..xf\xf 2 }. Recall that x™ 1 is the i th element of X a \ x a x 2 is the I th 
element of X° 2 and x" 1 < xf 2 , all as per the lexicograhic ordering on £. Hence, x" 1 is the i th 
element of X y ° for i = 1, . . . I and xf 2 is its (7 + l) th element. An analogous thing holds for 
X y o. Since X y ° and X y ° are same, their i th elements must also match. This means x" 1 = xf 1 
for i = 1, . . . / and 2 . This immediately implies X ai = X a i. Also from the conditions 

of generating candidates we have x® 1 = x°~ 2 and x" 1 = x" 2 for i = 1, ...(/ — 1). This together 
with x" 1 = x" 1 for i = 1, . . . I implies x" 2 = xf 2 for i = 1, . . . (I — 1). Finally combining this 
with x? 2 = xf 2 , we have I" 2 = X a 'i. Thus X y ° = X y o ==> X" 1 = X< and I" 2 = X a 2. 
Since the pairs (ai,a 2 ) and (a'^a^) are to be distinct, we need to have either (R a ' 2 R a ' 2 ) 
or (R ai ^ R a i). We now show that this cannot be the case with R y ° = R y o. Suppose either 
(R a2 ^ R a 2) or (R ai ^ R<). Without loss of generality assume, (R ai ^ R<). This means 
(since X ai = X 012 ) there is an edge (x, y) in R ai that is absent in R 01 ' 1 (or the other way round). 
Again without loss of generality assume there is an edge (x, y) in R ai that is absent in R a K 
Thus, if R y ° = R y o, we must have the edge (x, y) in R a 2. By the conditions for candidate 
generation, R y o can be viewed as the disjoint union of R ' 1 and E 2 , where E 2 is the set of all 
edges in R a ' 2 involving xf 2 . This is because the restriction of R 01 ' 1 to the first (I — 1) nodes of 
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a[ is identical to the restriction of R a ' 2 to the first (I — 1) nodes of a' 2 . (x,y) cannot belong to 
E 2 as neither x nor y can be x" 2 (because x" 2 does not belong to X a>1 which is same as X ai , 
which contains both x and y). Therefore the edge (x, y) E R ai and hence in R y °, cannot appear 
in R y 'o. This contradicts (ii) and hence R ai = R 012 and R a>l = R a ' 2 . This means that the pairs 
(ai,a 2 ) and (a^ct^) that we started off with cannot be distinct. 

On arguments similar to the r = s = case, we can show that no y r can be equal to any y' s . 
Hence we have shown that every candidate partial order is uniquely generated. We next show 
that every frequent episode would be in the set of candidates output by Algorithm [2] 

We show this by induction on the size of the episode. At level one, the set of candidates 
contain all the one node episodes and hence contains all the frequent one node episodes. Now 
suppose at level /, all frequent episodes of size / are indeed generated. If an (/ + l)-node 
episode a = (X, R) is frequent, then all its subepisodes are frequent. The maximal /-node 
subepisodes (X\{xi + i}, R\ x \{ Xl+1 }) and (X\{xi}, R\ x \{x,}) in particular, are also frequent and 
hence generated at level / (as per the induction hypothesis). Note that the (I— l)-node subepisodes 
obtained by dropping the last event-types of these two episodes are the same. Hence, the candidate 
generation method combines these 2 frequent subepisodes in atmost 3 ways. Since (X, R) is 
either a 3^1 or y 2 combination of these 2 episodes and also a valid partial order, Algorithm 
[2] generates it after the first step of candidate generation. The second step checks whether all 
its remaining maximal /-node subepisodes are also frequent. This condition is true as per the 
induction hypothesis and a is therefore generated in the list of candidates at level / + 1. Thus 
we can see that our candidate generation algorithm outputs all valid candidates without any 
repetition. 

B. Candidate Generation with structural constraints 

The candidate generation scheme described above is very flexible. In particular, we can easily 
specialize it so that we generate only parallel episodes or only serial episodes. For example, 
suppose that for every pair of combinable episodes we generate only the 3^o combination (and 
do not consider the and 3^2 combinations). Since for all level one episodes, X a is a singleton 
and R a is empty, if we do our y combination, then R a will be empty for all level-2 candidates. 
Now, since we use our y combination throughout, it is easy to see that R a would remain empty 
at all levels and then we will be generating only parallel episodes. Similarly, it is easy to see 
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that if we do only y 1 and y 2 combinations (an do not consider the 3^o combination) at all levels 
then we would generate only serial episodes. Thus the method we presented for mining general 
partial orders is easily specialized to a method for parallel episodes or serial episodes only. In 
addition, we can also specialize it to mine for certain classes of partial orders as explained below. 

Note that any class of partial orders where, for every partial order belonging to the class, all 
its maximal subepisodes also lie in the same class, our candidate generation algorithm is easily 
specialized to such classes of partial orders. We refer to such a class as satisfying a maximal 
subepisode property. For example, both the class of serial episodes and parallel episodes satisfy 
this property. To mine in a specific class of partial orders, one just needs to do an additional 
check and retain only those of the potential candidates generated, which belong to the class of 
interest. For mining either in the space of serial or parallel orders, one need not perform this 
explicit check of whether the generated candidates belong to the concerned class. Instead, a more 
efficient way as described earlier can be adopted. 

We discuss a few interesting classes of partial orders satisfying the maximal subepisode 
property. The first of them is the set of all partial orders, where length of the largest maximal 
path of each partial order (denoted as L max ) is bounded above by a user-defined threshold. 
Consider the episode a — (A — > ((F) (B — > (CD) — > E))). It has three maximal paths namely 
A —> B —> C ^ E, A —> B —> D —> E and A — > F and the length of its largest maximal path 
is 3. For L max = 0, we get the set of all parallel episodes because any iV-node parallel episode 
has iV-maximal paths each of length 0, and every non-parallel episode has atleast one maximal 
path of length 1. In general, for L rnax < k, the corresponding class of partial orders contains all 
parallel episodes, serial episodes of length less than (k + 1) and many more partial orders all of 
whose maximal paths have length less than (k + 1). It is easy to see that for any partial order 
belonging to such a class, all its subepisodes too belong to the same class. As k is increased, the 
class of partial orders expands into the space of all partial orders from the parallel episode end. 
Another class of partial orders of interest could be one, where the number of maximal paths in 
each partial order (denoted as N rnax ) is bounded above by a threshold. When N rnax < 1, the 
class obtained is exactly equal to the set of serial episodes. For any partial order belonging to 
this class, only its maximal subepisodes are guaranteed to belong to the same class. For example, 
consider a serial episode A — > B — > C. All its maximal sub-episodes are serial episodes. Its 
non-maximal subepisodes like (A B) do not belong to the set of serial episodes. 
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As and when the candidates are generated, we calculate and check whether their L max or N max 
values satisfy the bound constraint. We use the standard dynamic programming based algorithms 
to calculate L max or N max on the transitively reduced graph of each generated candidate partial 
order. We could also work on a class of partial orders characterised by an upper bound on both 
L max and N max , as such a class would also satisfy the maximal subepisode property. Mining 
with structural constraints can make the discovery process more efficient as compared to mining 
in the class of all injective partial orders. We illustrate with simulation how one can mine for 
partial orders with an upper bound constraint on L max or 

VI. Discussion 

We wish to point out that the proposed counting and candidate generation algorithms for 
injective episodes can be extended to a class of non-injective episodes, where nodes mapped 
to the same event lie along a chain in the associated partial order. It is interesting to note that 
the class of all non-injective serial episodes is contained in this special class of non-injective 
partial order episodes. To keep the representation of such /-node episodes unambiguos, the g- 
map is restricted (very similar to injective episodes) such that g(vi), g(v 2 ), ■ ■ ■ g(vi) obey the 
lexicographic order (total) on £. For example, suppose we have a 5-node episode with 3 of the 
nodes mapped to A and the remaining 2 mapped to B. Then, g(vi) must be A for % = 1,2,3 
and B for % = 4, 5. Further, since the episodes are such that the nodes mapped to the same event 
lie along a chain, we impose a special restriction on < a to avoid further ambiguity. Suppose 
Vi, v i+1 , . . . v i+m are mapped to the same event-type E. There are (m + 1)! total orders possible 
among these nodes, each of which would represent the same episode pattern. To avoid this 
redundancy, we restrict < a to be such that Vi < a v i+ i < a . . . v i+m . 

Consider a non-injective episode a with V a = v 2 , v 3 , i> 4 }, < a = {(t>i, t> 3 ), (v 2 , V4)}. g a () 
is such that g a (vi) = g a (v2) = A, g a (v3) = B and g a (vA) = C. To track an occurence of 
such an episode, we would initially wait for 2 As. Once we see an A, we could either accept 
the A associated with vi or v 2 . Depending on what we choose, we would now either wait for 
{A,B} (if accepted A is associated with vi) OR {A,C} (if accepted A is associated with v 2 ). 
As per our currrent counting stratedy, on seeing A there is more than one next state possible 
depending on the associated node in V a . Hence, a non-deterministic finite state automaton would 
be the right computational device to track occurences of a. In general a non-deterministic finite 
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state automaton(NFA) would be computationally more expensive compared to a deterministic 
automaton. Interestingly, a doesn't belong to the class of non-injective episodes that we are 
considering, i.e. the nodes vi and v 2 are not related even though they map to the same event 
A. We are trying to indicate that to count episodes like a, our strategy of counting leads to 
automata which are non-deterministic in nature. Even though an NFA can be converted to an 
equivalent DFA the number of states of this equivalent DFA can be huge. Hence, we have noted 
that counting is also not straight forward for episodes outside the class of non-injective episodes 
considered here in addition to problems with representation. We now argue how deterministic 
finite state automata(DFA) can still be used to track occurences of this class of non-injective 
episodes, even though in general for non-injective episodes, one requires NFAs OR hugher DFAs 
to track occurences. 

The DFA construction procedure for injective episodes can be generalized to this class of 
non-injective episodes. Each state would again be a tuple (Q a ,W a ). Q a here would be a 
multiset(essentially a set having repeated elements). Interestingly, one can verify that W a is 
always a proper set for every state in this construction. Suppose not, then W a would have 
atleast 2 repeated elements. All parents of their corresponding nodes are contained in the set 
of nodes associated with that of Q a (from the constructive definition). Note that the two set of 
nodes from V a associated with W a and Q a are disjoint. This means the two nodes which map 
to repeated elements in W a are unrelated in < a . But as per the class of non-injective episodes 
we are dealing with, two such nodes mapped to the same element must be related, which is a 
contradiction. Hence, W a is a proper set. Therefore, each transition from a given state would 
be on seeing a unique event type. This ensures that the finite state automaton so constructed is 
deterministic. Hence the counting algorithms proposed for injective episodes almost exactly go 
through for this class of injective episodes. 

We now elaborate on the candidate generation. We combine episodes a = ({vi, v 2) ■ ■ ■ vi}, < a 
, g a ) and (3 = ({v u v 2 , ■ ■ ■ v t }, < p , g p ) if (i) gjyi) = gpfa) Vi = 1, . . . (Z-l), (ii) < a \{ Vl ^... Vl _ 1 } 
= <p l{vi,ta,...t>,_i}. (iii)0a(«j) = gpivi) OR g a (vi) precedes g p (vi) as per the lexicographic 
ordering on £. Let V 1 = {v i, . . . , vi, < 7 is a relation on V 1 defined as follows. Vi < 7 Vj 

iff Vi < a Vj = 1, . . .1. Also, Vi = 1, 2 ...(/ — 1), we have < 7 vi + \ iff vi <p vi and 
vi+i <7 Vi iff v\ <p Vi. g 1 , a map from Vj to £ is such that <? 7 (i>i) = g a (vi) Vi = 1, . . . / and 
9i(vi +1 ) = gp{vi). 
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The way we combine a and (3 slightly varies with the two subconditions of (iii). Suppose 
g a (vi) precedes gp(v{) as per the lexicographic ordering on £ . Consider the following relations 
< 7 o=< 7 , < 7 i=< 7 U(uj,u J+ i), < 7 2=< 7 U(u J+ i,Ui). An episode (V J: < Ti ,# 7 ) is generated iff < 7 ; 
is a partial order. Note that this is exactly similar to what we have already been doing for injective 
episodes. The additional thing that needs to be done for this special class of non-injective episdes 
is as follows. Suppose g a {v{) = gp(vi), then we only ask if < 7l is a partial order. This is because 
of the following reason. We have g 7 (vi) = g a (vi) —gp(vi) = g 7 (vi + i). Hence, g 1 maps vi and 
v t+1 to the same event type. Recall that the unambigous representation(for the special class of 
non-injective episodes) demands that v\ < 7 vi+i. Hence the only permissible candidate would 
be (Vj, < 7 i,g 7 ). So we generate this as a candidate if and only if < 7l is a partial order. 

VII. Selection of Interesting Partial Order Episodes 

The frequent episode mining method would ultimately output all frequent episodes of upto 
some size. However, as we see in this section, frequency alone is not a sufficient indicator of 
interestingness in case of episodes with general partial orders. 

Consider an /-node episode, a = (X a ,R a ). (That is \X a \ = I). If a is frequent then all 
episodes a' = (X a ' , R a ') with X a ' = X a and R a> C R a would also be frequent /-node episodes 
because every occurrence of a would constitute an occurrence of a'. The point to note is that 
when we consider episodes with general partial orders, an episode of size / can have subepisodes 
which are also of size /. Such a situation does not arise if the mining process is restricted to either 
serial or parallel episodes only. For example there is no 4-node serial episode that is a subepisode 
of A — > B — > C — > D. However, when considering general partial orders, given a a = (X a , R a ) 
there can be, in general, exponentially many episodes a' = (X a ',R a ') with X a ' = X a and 
R a ' C R a . For example, (A(B -> C -> D)), (B(A -> C -> D)), (C(A -> B -> D)), 
(D(A -> 5 -> C)), (AB)(C -> D), (AS) -> C7 -> D, (ABC) -> D, A -> (SC) -> D etc. are 
all such subepisodes of A — > i? — > C — > -D. Thus, there is an inherent combinatorial explosion 
in frequent episodes of a given size when we are considering general partial orders and, hence, 
frequency alone may not be a sufficient indicator of 'interestingness'. In this section, we propose 
a new measure, called bidirectional evidence of an episode which can be used in conjunction 
with frequency of an episode to make the mining process more efficient and meaningful. 
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A. Bidirectional evidence 

A simple minded strategy to tackle the explosion of frequent episodes could be to use a notion 
similar to that of maximal frequent patterns that has been used in other datamining contexts such 
as item sets or sequential patterns. 

Definition 7: An £-node episode a' = (X a> , R a ') is said to be less specific than £-node episode 
a = (X a ,R a ) if X a ' = X a and R a ' C R a . Given a set of £-node episodes, an episode is a 
most specific episode if it is not less specific than any other episode in the set. (Note that, in 
general, there can be many most specific episodes in a given set of episodes). 

Now, after the mining process (that is, after finding all frequent episodes of size I, for a 
given I), we can output only the most specific episodes of the set of frequent episodes. This 
prunes out many partial orders (episodes) which are presumed uninteresting because a more 
specific partial order (episode) is frequent and interesting. This specificity-based filter is not 
wholly satisfactory though it reduces the number of frequent episodes (of a given size) that 
are output. Suppose the data actually contains the partial order (episode) (AB) — > C. Suppose 
there are 200 occurrences of this episode of which 1 10 are occurrences of A — > B — > C while 
90 are those of B — > A — > C. Depending on the frequency threshold, suppose one or both of 
these serial episodes are also frequent. The parallel episode (ABC), being less specific, would 
also be frequent (and would have a frequency greater than 200 ). The specificity based filter 
would always suppress the parallel episode (ABC) and importantly also suppress the episode 
(A B) — > C in preference to any of the serial episodes whenver they are frequent. Thus what we 
ouput depends very critically on the frequency threshold. In addition, if is also not satisfactory 
that whether or not we suppress (AB) — > C depends only on the counts of these episodes. 
Instead we can ask is there any evidence in the data to decide which of these partial orders is a 
better fit. If the data indeed contains only the partial order (AB) — > C then it would be the case 
that in most of the occurences of the parallel episode (ABC), C follows both A and B. We 
would also see that in occurences of (A B) — > C, A follows B roughly as often as it precedes B. 
Now the fact that we have seen A following B roughly as often as A preceeding B and that we 
have rarely seen C not following both A and B should mean that the partial order (AB) — > C 
is a better representation of the dependencies in data as compared to the serial episode or the 
parallel episode. Thus, in addition to frequency, it would be nice to evaluate interestingness of 
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partial orders based on whether there is evidence in the data for not constraining the order of 
occurence of some pairs of event types. That is, we can demand that in the occurrences of the 
episode (as counted by the algorithm) any two event types, i,j E X a , such that i and j are not 
related under R a should occur in either order 'sufficiently often'. We will now formalize this 
notion. 

Given an episode a let Q a = : i,j G X a ,i ^ j, ^ R a }. Let f a denote the 

number of occurrences (i.e., frequency) of a counted by our algorithm and let ffi denote the 
number of these occurrences where % precedes j. Let pfj = 'f a . To rate the interestingness 
of the partial order episode a we define a measure that tries to capture the relative magnitudes 
of pfj and p^. Let 

H* = ~V%log{ V %) - (1 - p°)log(l - pg) (10) 

Since, in each occurrence either % preceeds j or j preceeds i, we have pf 3 ■ — 1 — p^ and hence 
H°j is symmetric in Note that Hf- is the entropy of the distribution \pfj, (1 — pfj)]. We refrain 
from using the term entropy for H^, as pfj = ffi/ ' f a is tied to the specific subset of occurences 
counted by our algorithm 

The bidirectional evidence of an episode a denoted by H(a) is defined as follows. 

H{a)= min if?. (11) 

We use H(a) as an additional interestingness measure for a. Essentially, if H(a) is above 
some threshold, then there is sufficient evidence that all pairs of event types in a that are not 
constrained by the partial order R a appear in either order sufficiently often. We say that an 
episode a is interesting if (i) the frequency is above a threshold, and (ii) H(a) is above a 
threshold. 

We now explain how H(a) can be computed during our frequency counting process. For 
each episode, we maintain an I x / matrix a.H whose (i,j) th element would contain by the 
end of counting. For each candidate episode a, the matrix a.H is initialized to just before 
counting. For each automaton that is initialized, we initialize a separate I x I matrix of zeros 
stored with the automaton. Whenever an automata makes state transitions on an event-type j, 
for all i such that event-type i is already seen, we increment the entry in this matrix. The 
matrix associated with an automaton that reaches its final state, is added to a.H and results in 
increment of relevant ffi entries. Thus, at the end of the counting, a.H gives the ffi information. 
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B. Mining with an additional H(a) threshold 

One can use H a as a postprocessing filter. That is, after the mining process we only output 
those a (of a given size) where H a is a above a threshold. While this may reduce the number 
of frequent episodes output, it will not make the mining process efficient. A better way would 
be to use a threshold on H(a) at each size (or level) in our apriori style level-wise counting 
procedure. This can substantially contribute towards the efficiency of mining for general partial 
orders. However, unlike in the case of frequency threshold, it is not quite clear whether H (a) also 
posseses the so called anti-monotonicity property. The main difficulty is that H(a) is tied to a 
specific set of occurrences counted by the algorithm. However, if an episode a has a bidirectional 
evidence H(a) = e, in a given set of occurences, then one can see that any maximal subepisode 
of a (obtained by the restriction of R a onto a subset of X a ) also has a bidirectional evidence 
of atleast e in the same set of occurences. Hence atleast in cases where the embedded pattern's 
subpepisodes most often occur with the embedded pattern, the bidirectional evidence of all its 
maximal subepisodes will be atleast that of the embedded pattern. Since our candidate generation 
is based on the existence of all maximal subepisodes at the lower levels, the embedded pattern 
P most often comes up after mining, in the simulations. Further, the bidirectional evidence of all 
the non-maximal subepisodes of the embedded pattern will be very low (almost zero). This is 
because of the following. Any non-maximal subepisode 7 will not have some edge present 
in the embedded pattern, inspite of the nodes % and j being present in 7. If most occurences of 
7 are also those of (3, i precedes j in almost all occurences of 7 and hence if (7) is negligible. 
Hence almost all non-maximal subepisodes of (3 will have negligible bidirectional evidence 
inspite of being frequent. Therefore, we weed out almost all the non-maximal sub-episodes of 
P due to the H(f3) threshold being incorporated levelwise. These non-maximal sub-episodes if 
not weeded out, would otherwise contribute to the generation of many more patterns at various 
levels. In particular, it would result in the generation of all the less specific patterns of the 
embedded pattern as pointed in the beginning of this section, which doesn't happen now. We 
show through simulation that mining with H a threshold at each level is indeed very effective. 
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VIII. Simulation Results 

A. Synthetic Data Generation 

Synthetic data is generated by embedding occurrences of partial orders (episodes) in varying 
levels of noise. Input to the data generator is a set of episodes that we want to embed in the 
data. For each episode to be embedded, we generate an episode event stream just containing non- 
overlapped occurrences of the partial-order episode. We next generate a separate noise stream 
involving all event-types. We merge the various episode streams and the noise stream (that 
is, string together all events in all the streams in a time-ordered fashion) to generate the final 
data stream consisting of T time ticks. The data generation process has three user-specified 
parameters: i],p,p, whose roles are explained below. 

Each of the episode data streams are generated as follows. To embed each occurrence of an 
episode, we choose, at random, one of its serial extensions^ and then generate the occurrence 
by having a sequence of event types as needed, with the difference in times of occurrence of 
successive events being geometric with parameter 77. The time between successive occurrences 
of the episode is geometric with parameter p. 

We generate the noise stream as follows. Let Si denote the set of event types that appear in 
any of the embedded episodes. Any event-type not in E\ is referred to as a noise event-type. For 
each noise event-type, we generate a stream of just its occurrences, with time between successive 
events geometric with parameter p. Similarly, for each event-type in E\, we generate a stream 
of just its occurrences, with time between successive events geometric with parameter p/5. This 
is done to introduce some random occurrences of the event-types associated with the embedded 
partial orders. All these streams are merged to form a single noise stream. Noise stream is 
generated in this way so that there may be multiple events (constituting noise) at the same time 
instant. The noise data stream is merged with all the episode data streams to obtain the final 
data stream. 

B. Effectiveness of Partial Order Mining 

We first show that our algorithm is effective in unearthing the embedded partial orders in the 
data stream and also that our new measure of interestingness, namely, bidirectional evidence, is 

6 A serial extension of a partially ordered set (X a , R a ) is a totally ordered set (X a , R') such that R a C R' . 
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TABLE II 

Frequent Episode Output of the algorithm with and without bidirectional evidence. (Patterns: ai and 
aa, r) = 0.7, p = 0.055, p = 0.068, M = 60, T = 10000, / th = 350, T x = 15, H t h = 0.4) 



Level 


fth only 


H t h for post-filtering 


f th and H th 


#Cand 


#Freq 


#Cand 


#Freq 


#Cand 


#Freq 


1 


60 


60 


60 


60 


60 


60 


2 


5310 


565 


5310 


565 


5310 


565 


3 


3810 


435 


3810 


331 


3810 


331 


4 


1358 


760 


1358 


129 


623 


125 


5 


1861 


1855 


1861 


37 


36 


32 


6 


2993 


2993 


2993 


6 


6 


6 


7 




















Run Time 


134s 


142s 


52s 



very useful in improving the efficiency of the mining process. 

We generated a data stream of about 50, 000 events (from a set of 60 event types) with 10, 000 
time ticks, in which are embedded the partial orders a\ = (A — > (B C) — ► (D E) — > F) and 
ot-2 = (G — > ((H — > (J K))(I — > L)) both of which are 6-node episodes. Table [TT] shows the 
results obtained with our mining algorithm. We show the number of candidates (#Cand) and the 
number of frequent episodes (#Freq) at different levels. (Recall that at level k, the algorithm finds 
all frequent episodes of size k). The table shows the results for the cases: (i) when we only use a 
threshold on frequency (f t h only), (ii) when we use a threshold on frequency for mining but use 
a threshold on H(a) as a post processing filter at each level (H th for post processing) and (iii) 
when we use a threshold on frequency as well as on H(a) at each level (f t h and H th ). (Other 
parameters such as noise levels, thresholds, expiry time etc. are given in the table caption). 

The two embedded patterns are reported as frequent in all the three cases. However, with 
only a frequency threshold, a lot of uninteresting patterns (like the subepisodes of the embedded 
patterns) are also reported frequent. When we use a H(a) threshold based post processing 
filter (case (ii)), the number of candidates naturally remains the same, but the frequent episodes 
output comes down drastically as can be seen from the table. However, the run-time actually 
increases marginally because of the overhead of calculating H(a). When we use a threshold on 
both frequency as well as H(a), the efficiency improves considerably as can be seen from the 
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TABLE III 

Details of frequent episodes obtained when we use only a frequency threshold. (Patterns: ai and a 2 , 
r] = 0.7, p = 0.055, p = 0.068, M = 60, T = 10000, f th = 350, T x = 15. ) 





Subepisodes 


Non-subepisodes 


Level 


#Cand 


#Freq 


#Max 


#Non-max 


#Noise 


#Mix 


#Super 


#Others 


ai 


a 2 




a 2 


U\ 


a 2 




a 2 


1 


60 


60 


6 


6 








48 

















2 


5310 


565 


15 


15 


8 


13 


474 


36 


4 











3 


3810 


435 


20 


20 


49 


96 


10 


214 


13 





13 





4 


1358 


760 


15 


15 


142 


411 





52 


19 





106 





5 


1861 


1855 


6 


6 


228 


1268 








12 





335 





6 


2993 


2993 


1 


1 


174 


2385 








3 





429 






TABLE IV 

Details of frequent episodes obtained when we use bidirectional evidence as a post-filter. (Patterns: qi 
and a 2 , r) = 0.7, p = 0.055, p = 0.068, M = 60, T = 10000, f th = 350, T x = 15, H th = 0.4) 





Subepisodes 


Non-subepisodes 


Level 


#Cand 


#Freq 


#Max 


#Non-max 


#Noise 


#Mix 


#Super 


#Others 


a\ 


a 2 


a\ 


a 2 




a 2 




a 2 


1 


60 


60 


6 


6 








48 

















2 


5310 


565 


15 


15 


8 


13 


474 


36 


4 











3 


3810 


331 


20 


20 


27 


23 


10 


214 


13 





4 





4 


1358 


129 


15 


15 


14 


15 





41 


19 





10 





5 


1861 


37 


6 


6 


1 


6 








12 





6 





6 


2993 


6 


1 


1 





1 








3 












reduction in number of candidates as well as run-time. As can be seen from the table, whether we 
use threshold on H(a) only for post processing the outputs or also for reducing the candidates 
at each level, we get essentially the same output at all levels; at level 6, the two embedded 
patterns along with some superepisodes are the only ones output. We note that even when we 
use thresholds on both H(a) as well as frequency, we simply refer to the output as 'frequent 
episodes.' 



39 



TABLE V 

Details of frequent episodes obtained when we use Bidirectional Evidence threshold at each level. 
(Patterns: on and a 2 , r\ = 0.7, p = 0.055, p = 0.068, M = 60, T = 10000, f th = 350, T x = 15, H th = 0.4) 





Subepisodes 


Non-subepisodes 


Level 


#Cand 


#Freq 


#Max 


#Non-max 


#Noise 


#Mix 


#Super 


#Others 


ai 


a 2 




a 2 










1 


60 


60 


6 


6 








48 

















2 


5310 


565 


15 


15 


8 


13 


474 


36 


4 











3 


3810 


331 


20 


20 


27 


23 


10 


214 


13 





4 





4 


623 


125 


15 


15 


13 


15 





41 


19 





7 





5 


36 


32 


6 


6 


1 


6 








12 





1 





6 


6 


6 


1 


1 





1 








3 












Columns #Cand and #Freq indicate the number of candidate and frequent episodes obtained at 
each level respectively. The remaining columns of this table explains the various different kind of 
frequent episodes obtained at different levels. The columns under Subepisodes category indicate 
the number of frequent subepisodes of the embedded patterns at each level. The columns under 
Non-subepisodes category describe the various frequent episodes which are not subepisodes of 
any of the embedded patterns. Column #Max indicates the number of maximal subepisodes of 
each embedded pattern at each level. Column #Non-max indicates the number of non-maximal 
subepisodes of both embedded patterns at each level. Any episode which has an associated 
noise event-type E\, the set of all event-types associated with the embedded partial orders) 
is referred to as a noise episode. (A — > Z) is a noise episode for example. The number of such 
frequent noise episodes at each level is given in column #Noise. The information of episodes 
all whose associated event-types are contained in £ 1 and necessarily involving event-types from 
atleast two of the embedded patterns, is tabulated in column #Mixed. The current event-stream, 
of course is generated by embedding only two patterns. An episode like (A — > B — > G — > H) 
is a mixed episode. Consider an episode a = (X, R) either under the super or others category 
(columns #Super or #Others respectively). All event- types from X necessarily come from one 
of the embedded patterns (say a±). Consider the maximal subepisode a[ of oc\ obtained by 
its restriction on X. If a is a super-episode of a' v then it belongs to the super category. For 
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example, (A — > B — > C) is a super-episode of the maximal subepisode (A — > (BC))(of a{). 
Similarly, (H — > J — > if) is a super-episode of the maximal subepisode (77 — > (J K)(of a 2 ). 
If a is neither a super nor sub-episode of a[, then it belongs to others category. For example, 
consider the maximal subepisode a[ = (B C) — > _D(of a = (C -> (BD) would belong 
to the others category. #Init column in Table [IV] and Table [V] indicates the number of episodes 
which are both frequent and have a high enough H(a). 

From Table Hn] we see that using only a threshold on frequency leads to a total of 2993 
episodes of size 6 being reported as frequent. Of these, two are the embedded patterns (under 
the maximal subepisodes category), 174 and 2385 are non-maximal subepisodes of ot\ and «2 
respectively, 3 are super-episodes of ol\ and 429 are spurious episodes that do not contain any 
'noise' event type. The results in Table [V] show that when we use a threshold on both frequency 
and if (a), only 6 episodes of size 6 are reported as frequent: the two embedded patterns, one non- 
maximal subepisode of a 2 and three super-episodes of a,\. Thus, when we use only a threshold 
on frequency, most of the episodes reported as frequent are the non-maximal subepisodes which 
can never be eliminated based on their frequencies because they occur at least as frequently as the 
embedded patterns. This is the inherent combinatorial explosion in partial order mining that we 
pointed out in Sec. IVII-AI Bidirectional evidence is effective in eliminating these and reporting 
only the actual partial orders embedded in the data. This is because patterns grouped under 
Non-maximal subepisodes and Others category would have a pair of event-types which are 
not related in these episodes, but are related in one of the embedded patterns. Since, most of the 
occurrences of these episodes come from the embedded pattern, it is easy to verify from Eq. [77] 
that almost all these patterns have a very low bidirectional evidence. This effect is seen at all 
levels in the tables. From Tables HVl and Table IVl we see that the frequent episodes output are 
essentially the same whether we use a post-processing or a level-wise threshold on H(a). All 
these results show that using a level-wise threshold on H(a) provides substantial improvement 
in efficiency while not missing any important patterns in the set of frequent episodes output. 

Also the 77(a) based threshold helps us in mining larger sized patterns. For example, when 
this algorithm was run (with only a frequency threshold) on a data stream with an 8-node episode 
embedded in it, even after a run-time of about 300 seconds, the algorithm was still counting 
the candidates at level 7. This is mainly due to the inherent combinatorial explosion in partial 
order mining. Most of the patterns reported in the non-max and others category at lower levels 
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TABLE VI 

Frequent Episodes obtained when the algorithm is run in serial, parallel and general mode. (Patterns: 2 
serial, 2 parallel, ai and a 2 , r) = 0.7, p = 0.055, p = 0.068, M = 100, T = 10000, fth = 375, T x = 12, H th = 0.4) 



Level 


Serial mode 


Parallel mode 


General mode 


#Cand 


#Freq 


#Cand 


#Freq 


#Cand 


#Freq 


1 


100 


100 


100 


100 


100 


100 


2 


9900 


54 


4950 


555 


14850 


609 


3 


58 


58 


4830 


71 


6422 


225 


4 


34 


34 


37 


33 


184 


156 


5 


12 


12 


12 


12 


60 


60 


6 


2 


2 


2 


2 


10 


8 


Run Time 


58 s 


1 min 28 s 


3m 07 s 



TABLE VII 

Results obtained when mining with thresholds on L max and N max . 
p = 0.045, p = 0.055, r\ = 0.7, M = 100, T = 10000, f th = 300, H th = 0.35, T x = 15 



Lmax 


Nrnax 


#Satisfying(fig. [5} 


#freq 


Run-time 





10 


1 


1 


6 m 29 s 


2 


10 


2 


3 


9 m 48 s 


5 


4 


3 


5 


9 m 45 s 


6 


2 


1 


2 


2 m 57 s 


7 


1 


1 


1 


53 s 


7 


6 


5 


10 


9 m 55 s 


7 


18 


8 


13 


10 m s 


3 


3 








9 m 27 s 



contribute to the generation of a huge number of uninteresting frequent patterns at higher levels, 
inturn leading to a huge number of candidate patterns at higher levels. Hence, counting at level 7 
was taking a lot of time. Mining with a H (a) threshold, we could discover the 8-node embedded 
episode in a reasonable amount of time. 

C. Flexibility in candidate generation 

As described in Section IV-B[ the same algorithm (with minor modifications in the candidate 
generation) can be used to mine either serial episodes, parallel episodes or any sub-class of partial 
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TABLE VIII 

Run-time as noise level is increased by varying p. Patterns Embedded: (hi) & (vi) from fig(5] 
p = 0.055, rj = 0.7, M = 100, T = 10000, f th = 300, T x = 15, H th = 0.35. 



p 


Noise level(L ns ) 


Run-time 


0.005 


0.43 


3 s 


0.02 


0.75 


6 s 


0.03 


0.82 


30 s 


0.045 


0.87 


1 m 45 s 


0.05 


0.885 


6 m 1 s 



orders satisfying the maximal- subepisode property. To illustrate this, we generated a data stream 
of about 50, 000 events where, in addition to the episodes a% and a 2 defined in Sec. IVIII-B[ 
we embedded two more serial episodes and two more parallel episodes. We ran our algorithm 
on this data in the serial episode, parallel episode and the general modes. When run in the 
serial episode mode and the parallel episode mode, we recovered the two serial and the two 
parallel episodes respectively. In the general mode, all six embedded partial orders (along with 
two other episodes which were superepisodes of the embedded partial orders) were obtained. 
Table |VT] shows these results. 

Next, we generated synthetic data by embedding all the 8 partial orders of Figure [5] Recall 
that L max for a partial order is the length of its largest maximal path. Similarly, N max for a 
partial order is the number of maximal paths in it. We present results obtained by mining in this 
data under different thresholds on L max and N max (Table I VIII) . The column titled 'Satisfying 
(fig. |5]) ' refers to the number of partial orders in Figure [5] which satisfy the corresponding L max 
and N max constraints. We get all the embedded patterns that satisfy the L max , N max constraint as 
frequent episodes along with a few extra episodes (as seen under #freq). From the table we see 
that at lower thresholds on either L max OR N max , the algorithm runs faster. At higher thresholds, 
the run-times were almost the same as those for mining all partial orders. This is because most 
of the computational burden is due to large number of candidates at levels 2 and 3, and the 
candidates at these lower levels are not reduced if the bounds on L max and N max are high. 
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Fig. 5. Partial Order Episodes used for embedding in the data streams. 




Fig. 6. Variation in number of frequent episodes as a function of frequency threshold, No. of embedded episodes, N £m b = 5, 
p = 0.055, p = 0.068, T) = 0.7, M = 100, T = 10000, H th = 0.75, T x = 15. 



D. Scaling and other properties of the algorithm 

Our mining algorithm is robust to choice of frequency and H(a) thresholds as illustrated 
in figures [6] & [71 Once the threshold is high enough to eliminate noisy/spurious episodes, the 
number of episodes output is close to constant over a wide range of threshold choices. 

The algorithm also scales well with number of embedded patterns, data length and noise level. 
In Tables [X] [IX] & IVIIIl the data is generated from a set of 100 event types, with different 8- 
node episodes embedded from fig. \5\ The run-times given are average values obtained over 10 
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Fig. 7. Variation in number of frequent episodes as a function of H(a) threshold, fth = 360, rest same as previous fig. 

TABLE IX 

Run-Time as the data length is increased. f t h/T = 0.03, p = 0.045, rest same as table IviTiI 



T 


Data Length(7i) 


Run-time 


5000 


22,500 


52 s 


10000 


45,000 


1 m 45 s 


15000 


67,500 


2 m 36 s 


20000 


90,000 


3 m 25 s 



different runs. 

Let N ns and N sig be the expected number of noise and signal events respectively in the data 
stream using our simulation model. By noise events here, we refer to the events in the noise 
stream. Similarly, by signal events, we refer to all the events coming from the various episode 
streams. In a data stream with N em b embedded episodes (each of size I), one can verify that 
N ns = (\£ — £i\p + £ 1 p/5)T and N sig = n™y^ x , p . We define the noise level (L ns ) as fraction 
of the expected number of noise events, i.e. L ns = N . Table [Villi describes increase in 
run-times with noise level L ns , which is the ratio of the expected number of noise events to 
the expected total number of events as per our simulation model. We see that for low p (say 
0.02) the running time is very less. THis is because at the one node level only the signal events 
are frequent, as a result of which the number of candidates in successive levels are less. As p 
increases the number of candidates at 2 and 3 node level increases. Thus running times go up. 
For p = 0.045, the number of candidates in the 3 node level goes up to the order of 30,000, 
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TABLE X 

Run-Time as the number of embedded patterns is increased, p = 0.045, rest same as table IviTTI 



No. of patterns 


Embedded Patterns (Fig. [5} 


Run-Time 


2 


(iii),(vi) 


1 m 45 s 


5 


(i),(iii),(v),(vi),(viii) 


4 m 18 s 


8 


(i)-(viii) 


11 m 10 s 



because many 2-node episodes are frequent. Consequently, the running times are very high for 
noise levels at about 0.88. 

Similarly, Table [IX] describes the run-time variations with data length. We observe that the 
run-times increase almost linearly with data length. As the data length is increased, the ratio of 
fth/T is kept constant. Table |X] shows the run-time variations with the number of embedded 
partial orders. We see an increase in the run-times because of increased number of candidates 
as the number of patterns is increased. 

IX. Conclusions 

In this paper we presented a method for discovering frequent episodes with general partial 
orders. Episode discovery from event streams is a very useful data mining technique though all 
the currently available methods can discover only serial or parallel episodes. Here, we presented 
a finite automata based algorithm for counting non-overlapped occurrences of injective episodes 
with general partial orders. (Along the way, we note some interesting properties of the finite 
state automaton used to track the occurences). The method is efficient and can take care of 
expiry-time constraints. The candidate generation algorithm presented here is very flexible and 
can be used to focus the discovery process on many interesting subclasses of partial orders. In 
particular, our method can be easily specialized to mine only serial or only parallel episodes. 
Thus, the algorithm presented here can be used as a single method to discover serial episodes 
or parallel episodes or episodes with general partial orders. Another important contribution of 
this paper is a new measure of interestingness for partial order episodes, namely, bidirectional 
evidence. We showed that there is an inherent combinatorial explosion in the number of frequent 
episodes when one considers general partial orders. Our bidirectional evidence is very useful in 
discovering the most appropriate partial orders from the data stream. The effectiveness of the 
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data mining method is demonstrated through extensive simulations. 

In this paper we have considered injective episodes and a special subclass of non-injective 
partial order episodes (which includes all injective partial order episodes). We note that this 
subclass includes the set of all non-injective serial episodes. In that sense, our algorithms truely 
generalize existing serial and parallel episode discovery algorithms. Extending the ideas presented 
here to the class of all partial order episodes is an important problem. Another potential direction 
is the development of a statistical significance test for general partial order patterns in event 
streams. We will address these in our future work. 
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X. Appendix 

Finally, we describe the more efficient GetPotent ialCandidates () function (listed as 
Algorithm 13). The input to Algorithm \3\ is a pair of episodes, cti and a 2 , both of size I, and 
both appearing in the same block of the set, Tu of frequent /-node episodes. Recall that ol\ 
and a 2 are identical in their first (/ — 1) nodes (in respect of both the associated event-types as 
well as the partial order among these event-types). This common (/ — l)-size partially ordered 
set is denoted as X. The output of Algorithm^ is the set, V, of potential candidates that can 
be constructed from «i and a 2 . The function GetPotent ialCandidates () constructs a 
34) combination of ol\ and a 2 as per (eqnJT]) using the function Simple Jo in () and retains 
only those combinations of «i and a 2 (of the three possible) which satisfy transitivity. The 
GetPotentialCandidates ( ) function decides the valid combinations based on some spe- 
cial checks on the kind of nodes in X. 

For purposes of easier illustration, we classify the nodes in X based on its relation with xf 1 
and xf 2 . We would have 9 such type of nodes. A node z G X is of the following types. 

(1) - (xf\z) and (z,xf 2 ) belong to R. 
(V) - (xf 2 ,z) and (z, xf 1 ) belong to R. 

(2) - (xf 1 , z) G R, no edge between z and xf 2 . 
(2') - (z^f 1 ) G R, no edge between z and xf 2 . 

(3) - (xf 2 ,z) G R, no edge between z and xf 1 . 
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(3') - (z, x" 2 ) G R, no edge between z and xf 1 . 
(4) - (z,xf 2 ) and (z,^ 1 ) belong to R. 
(4') - (af 1 ,*) and (xf\z) belong to i?. 
(4") - neither connected to xf 1 nor xf 2 . 

We describe the GetPotent ialCandidates () function with these nodes in mind. If a 
node of type (1) exists (condition as per line 3), then is the only generated candidate (as per 
lines 7, 9, 10). Similarly, if a node of type(l') exists (condition as per line 4), then y 2 is the only 
generated candidate (as per lines 8 — 10). Suppose neither nodes of the type (1) nor (1') exist, 
then 3^o is a sure candidate. Further, is generated iff nodes of type (2') and (3) do not exist 
in X. Similarly, y 2 is generated iff nodes of type (2) and (3') dont exist in X. One can verify 
this from lines 14 onwards in GetPotent ialCandidates ( ) . To show its correctness, we 
first make an important observation state as a lemma. 

Lemma 1: In y , if a node of type (1) exists, there cannot exist nodes of type (1'), (2') and 
(3). Similarly, if a node of type (1') exists, there cannot exist nodes of type (1),(2) and (3'). 
This also holds for and y 2 . 

Proof: Given that a node z of type (1) exists in X, we will show that none of above 3 
type of nodes can exist by contradiction. Suppose a node z\ of type (1') exists, then (zi,xf x ) G 
R =>■ (z^xf 1 ) G R ai . Since z is of type (1), {x?\zo) G R =>■ {xf\z ) G R ai . By 
the transitivity of R ai , it follows that (z 1: z ) G R±. Also, since z is of type (1), we have 
(z ,x" 2 ) G R =>- (z ,xf 2 ) G R a2 . Likewise, z 1 being of type (1') also says (xf 2 ,z ) G 
R =>- (xf 2 ,zi) G R a2 . Hence the transitivity of R 2 tells us that (zq, z±) G R 2 . So we now 
have the same pair of nodes being connected in opposite ways in R ai and R a2 . This contradicts 
condition (2) of comvining R ai and R a2 that they share the same partial order on x x x 2 . . . 

Suppose a node z\ of type (2') exists, then (zi.xf 1 ) G R {zi.x^ 1 ) G R ai . Also 

(xf 1 ,^) G R ai . Transitivity of R ai tells us {z-l,Zq) G R\. Since both z and z 1 both belong to 
Xxx 2 . . . xi-i, (z 1: z ) G R 2 . We also have (z ,xf 2 ) G R (z ,xf 2 ) G R 012 . Transitivity in 

R 2 now implies (zi,x^ 2 ) G R a ' 2 and hence is in R. But this edge must be absent as z\ is of 
type(2'). A similar contradiction arises for a node of type (3). 

The second statement of the theorem has proofs analogous to that of the first statement. 
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We will now show that this efficient procedure generates the correct relations. 

Theorem 1: The generated realtions (among the three combinations y , y± and y 2 possible) 
as per Algorithm^ are all transitively closed and the ones not generated violate transitivity. 

Proof: Let us list out the six possibilities that need to be checked, for proving transitivity 
(because a 1 and a 2 share the same subepisode on dropping their last nodes respectively, as 
already discussed in the candidate generation section). Let z' denote an element belonging to X. 

(a) (^xr),«\xr) G y * (2 '' x " 2) e y * 

(b) (z',xf 2 ), (xrX 1 ) e y* (^,af ) e y* 
(dXxf 2 ,/),^',^ 1 ) =*► (x? a J af l )ey < 

(e) (xf\x? 2 ),(xf 2 ,z') G^ => (xf\z') G # 

(f) (z^O,(z|V') ey t => (x?V) g y< 

We do these six transitivity checks on a case by case basis as adopted by the procedure. 

Case(i) A node z of type (1) exists in X : Here, we need to show that y , y 2 are not transitively 
closed and is indeed closed. 

Since z is of type (1), (xf 1 ^) and (z,xf 2 ) are present in both y and y 2 . But transitivity 
demands the edge (xf 1 , xf 2 ) which is absent in both y and y 2 . Hence both of them are not 
closed. 

To prove the transitive closedness of Y\, let us perform the six checks listed above. If hypothesis 
of (a) is true, and suppose (z',xf 2 ) y±), then either there exists an edge (xf 2 ,z') G y 1 or 
there exists no edge between z' and x" 2 . In the first case z' must be of type (1') which cannot 
exist from lemma [JJ In the second case z' must be of type (2') which also cannot exist from 
lemma [TJ This proves (a). Hypothesis of (b) and (f) cannot be true in y 1 because of the reverse 
edge (xp,x" 2 ). (c) is obviously true in y t . The hypothesis of (d) indicates the existence of a 
type (1') node in which is not possible from lemma [Q Correctness of (e) is similar to that 
of (a). If hypothesis of (e) is true, and suppose (xf 1 , z') ^ 3^i)» then either there exists an edge 
(z'jxf 1 ) G y-i or there exists no edge between z' and i* 1 . In the first case z' must be of type 
(1') which cannot exist from lemma [JJ In the second case z' must be of type (3) which also 
cannot exist from lemma [TJ This proves (e). 

Case(ii) A node of type (1') exists in X: This is analogous to case (i). 

Case(iii) Neither a node of type(l) nor type(l') exists : First we need to show that 3^o is 
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closed always here. Hypothesis of (a), (b), (e) and (f) are never true as they involve a direct 
edge between xf 1 and x" 2 which is not present in 3V Hypothesis of (c) and (d) demand the 
existence of nodes of type (1) and (1') respectively which dont arise this scenario. This shows 
the transitivity of 3^o m this case. 

Further, we show that 3^i is closed iff no nodes of type (2') and (3) exist in X. 
(=>) Let us prove the contrapositive of the forward statement. If a node z' of type (2') exists, 
then we have (z'^xf 1 ), (x" 1 ,^" 2 ) G 3\ but there is no edge between z' and xf 2 . This violates 
transitivity of 3V similarly, if a node z' of type (3) exists, then we have (xf 1 , xf 2 ), (x" 2 ,z') e 3V 
but there is no edge between z' and xf 1 . This violates transitivity of 3V 

(<=) Suppose no nodes of type (2') and (3) exist, we will show the closedness of 3V If 
hypothesis of (a) is true, and suppose (Y, xf 2 ) £ 3V, men either there exists an edge (xf 2 ,z') e 
3^i or there exists no edge between z' and xf 2 . In the first case z' must be of type (1') which 
cannot exist here in case(iii). In the second case z' must be of type (2') which also cannot exist 
from the hypothesis. This proves (a). The hypothesis of (b) and (e) are not satisfied here as they 
involve the edge (x^^xf 2 ). The hypothesis of (c) and (d) demands the existence of nodes of 
type(l) and (1') which cannot exist here in case (iii).If hypothesis of (e) is true, and suppose 
(xf 1 ,^) 3\), then either there exists an edge (z'jxf 1 ) E y 1 or there exists no edge between 
z' and xf 1 . In the first case z' must be of type (1') which cannot exist here in case(iii). In the 
second case z' must be of type (3) which also cannot exist from the hypothesis. This proves (e). 

Further, we show that y 2 is closed iff no nodes of type (2) and (3') exist in X. The proof of 
this is analogous to that of 3V ■ 
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Algorithm 3: GetPotentialCandidates(ai, a 2 ) 



Input: Patterns, ai and oli, both of size Z 
Output: P, candidate possibilities from a\ and a 2 

1 Initialize fig, flgl, flg2 <- and P <- 0; 

2 for (i <- 1; i < I - 1 and /Zg = 0; i <- i + 1) do 

3 if (ai.e[l]\i] = 1 and a 2 .e[i][Z] = 1) then fig <- 1; 

4 if (a 2 .e[2][i] = 1 and ai.e[i][Z] = 1) then fig <- 2; 

5 if /Zp / then 

6 71 <— Simple Join (Qi, Q2 ) ; 

7 if flg= 1 then 7 i.e[Z][Z + /] <- 1; 

8 else 7i.e[Z + 1][Z] <- 1; 

9 Add 71 to V; 

10 return P; 

11 else 

12 71 <— Simple Join (ai, Q2 ) ; 

13 Add 71 to P; 

14 for i <— 1 to Z — 1 do 

15 if fai.e[Z][i] = 1 and a 2 .e[Z][i] = Oj or 

16 fai.e[i][Z] = and a 2 .e\i][l] = 1J then 

17 flgl = 1; 

18 if (ai.e[Z][i] = and a 2 .e[/][i] = lj or 

19 fai.e[i][Z] = 1 and a 2 .e\i][l] = 0) then 

20 /Z<?2 = 1; 

21 if flgl = and flgl = then 

22 72 <— Simple Join (ai, ai ) ; 

23 72 .e[Z][Z + l] <-l; 

24 Add 72 to P; 

25 73 <— Simple Join (ai, 02); 

26 73 .e[Z + l][Z] 1; 

27 Add 73 to V; 

28 if flgl = 1 and flg2 = then 

29 72 <— Simple Join (ai, cti ) ; 

30 72 .e[Z][Z + l] <- 1; 

31 Add 72 to P; 

32 if /Zffl = and flg2 = 1 then 

33 73 <— Simple Join (ai, 02) ; 

34 73 .e[Z + l][Z] <- 1; 

35 Add 73 to P; 

36 return P; 



